The three stages: transcription, translation, delivery
Every AI church translation system works in three stages:
- 1Speech recognition (STT)The audio from your pastor's microphone is converted to text in real time. The most common engine used by church translation tools is Deepgram — a neural speech recognition system optimised for low latency.
- 2Translation (MT)The transcribed text is passed through a neural machine translation model (typically based on Google, DeepL, or OpenAI's translation APIs) and converted to the target language.
- 3Delivery (WebSocket streaming)The translated text is pushed to every attendee's browser or app in real time via WebSockets — the same technology that powers live sports updates and chat apps. Good systems achieve end-to-end latency of 300–600ms.
What makes church AI translation different
General-purpose translation models are trained on web content, news, and books — not church sermons. Church vocabulary (theological terms, Bible references, proper nouns like 'Gethsemane' or 'propitiation') can be mistranslated by general models. Purpose-built church translation tools address this through fine-tuned models (Kaleo AI), custom vocabulary systems (Voco's glossary), or AI prompt conditioning.
What affects accuracy
- Audio quality — the biggest single factor. Clean audio from the sound desk beats any mic in the room.
- Speaking pace — clear, moderately paced speech transcribes better than very fast or very soft speech.
- Language pair — high-resource language pairs (English→Spanish, English→French) have better accuracy than lower-resource pairs.
- Custom vocabulary — configuring theological terms and proper nouns in the system's glossary improves accuracy for church-specific language.