Speech, Translation & Voice Agents
Give your AI apps a voice. Learn how to implement speech-to-text, text-to-speech, custom speech models, multimodal audio reasoning, and integrate speech as an agent modality.
Adding voice to AI
Speech capabilities let your AI hear and speak — turning a text chatbot into a voice assistant.
Speech-to-text (STT) converts what someone says into text the AI can process. Text-to-speech (TTS) converts the AI’s text response into spoken audio. Put them together with an agent, and you have a voice-powered AI assistant that can take calls, answer questions, and even translate between languages in real-time.
Speech capabilities overview
| Capability | Service | What It Does |
|---|---|---|
| Speech-to-text (STT) | Azure AI Speech | Converts spoken audio to text transcripts |
| Text-to-speech (TTS) | Azure AI Speech | Converts text responses into natural-sounding speech |
| Custom speech | Azure AI Speech (custom model) | STT trained on your vocabulary (medical terms, product names) |
| Custom voice | Azure AI Speech (custom neural voice) | TTS with a voice unique to your brand |
| Speech translation (built-in) | Azure AI Speech (TranslationRecognizer) | Real-time speech-to-translated-text or speech-to-speech, built into the Speech SDK |
| Speech translation (pipeline) | STT + Azure Translator + TTS | Alternative approach: transcribe, translate text, synthesise — broader language coverage |
| Audio reasoning | Multimodal models | Model reasons about audio content directly (tone, emotion, context) |
Speech in agent workflows
| Feature | Text Agent | Voice-Enabled Agent |
|---|---|---|
| Input | User types a message | User speaks — STT converts to text |
| Processing | Agent reasons on text | Agent reasons on text (same as text agent) |
| Output | Agent returns text response | TTS converts response to speech |
| Use case | Chat widgets, messaging apps | Phone systems, accessibility, hands-free |
| Added complexity | None | STT accuracy, voice quality, latency |
Custom speech models
| Feature | Standard Speech | Custom Speech |
|---|---|---|
| Vocabulary | General-purpose | Trained on your domain terms |
| Accuracy | Good for common language | Excellent for industry jargon |
| Setup | Ready to use | Requires training data (audio + transcripts) |
| Use case | General transcription | Medical dictation, legal transcription, product names |
Multimodal audio reasoning
Newer multimodal models can process audio directly — not just transcribe it, but reason about it:
| Capability | What It Does | Example |
|---|---|---|
| Emotion detection | Identifies speaker emotion from audio | Detect frustrated caller before routing to specialist |
| Speaker diarization | Distinguishes between speakers (Azure AI Speech feature) | Separate customer and agent in a call recording |
| Audio event understanding | Understands non-speech audio cues | Detect background noise indicating emergency |
| Tone analysis | Assesses communication tone | Flag aggressive or threatening tone for escalation |
Real-world example: Atlas Financial's voice compliance system
Atlas Financial integrates speech into their compliance agent:
Customer service calls:
- STT — Azure Speech transcribes the call in real-time
- Agent — Compliance agent monitors the transcript for regulatory keywords
- Audio reasoning — Multimodal model detects customer frustration levels
- Translation — If the customer switches languages, real-time translation kicks in
- TTS — Agent responses are spoken back to the customer in their language
Custom speech model: Trained on financial terminology (SWIFT codes, IBAN numbers, product names) for 98%+ transcription accuracy on banking-specific terms.
Exam tip: Speech translation methods
Three approaches to speech translation:
- Azure AI Speech built-in translation —
TranslationRecognizerin the Speech SDK. Best for real-time speech-to-speech with lowest latency. - STT + Azure Translator + TTS pipeline — Transcribe, translate text, synthesise. Broader language coverage than built-in.
- STT + LLM + TTS — Transcribe, LLM translates with context awareness, synthesise. Best for nuanced/contextual translation.
For real-time call translation → Speech SDK built-in (lowest latency). For many languages → Translator pipeline. For context-heavy professional content → LLM-powered.
Key terms
Knowledge check
NeuralMed builds a phone-based symptom checker. Patients describe symptoms by speaking, and the AI responds with guidance. The system needs to accurately transcribe medical terms like 'acetaminophen' and 'tachycardia'. Which approach should they use?
Kai's logistics call centre handles calls in English, Mandarin, and Hindi. They need real-time translation between all three languages during customer calls. Which approach is most practical?
🎬 Video coming soon