Speech, Translation & Voice Agents

Adding voice to AI

Simple explanation

Speech capabilities let your AI hear and speak — turning a text chatbot into a voice assistant.

Speech-to-text (STT) converts what someone says into text the AI can process. Text-to-speech (TTS) converts the AI’s text response into spoken audio. Put them together with an agent, and you have a voice-powered AI assistant that can take calls, answer questions, and even translate between languages in real-time.

Speech capabilities overview

Capability	Service	What It Does
Speech-to-text (STT)	Azure AI Speech	Converts spoken audio to text transcripts
Text-to-speech (TTS)	Azure AI Speech	Converts text responses into natural-sounding speech
Custom speech	Azure AI Speech (custom model)	STT trained on your vocabulary (medical terms, product names)
Custom voice	Azure AI Speech (custom neural voice)	TTS with a voice unique to your brand
Speech translation (built-in)	Azure AI Speech (TranslationRecognizer)	Real-time speech-to-translated-text or speech-to-speech, built into the Speech SDK
Speech translation (pipeline)	STT + Azure Translator + TTS	Alternative approach: transcribe, translate text, synthesise — broader language coverage
Audio reasoning	Multimodal models	Model reasons about audio content directly (tone, emotion, context)

Speech in agent workflows

Text vs voice-enabled agents
Feature	Text Agent	Voice-Enabled Agent
Input	User types a message	User speaks — STT converts to text
Processing	Agent reasons on text	Agent reasons on text (same as text agent)
Output	Agent returns text response	TTS converts response to speech
Use case	Chat widgets, messaging apps	Phone systems, accessibility, hands-free
Added complexity	None	STT accuracy, voice quality, latency

Custom speech models

Feature	Standard Speech	Custom Speech
Vocabulary	General-purpose	Trained on your domain terms
Accuracy	Good for common language	Excellent for industry jargon
Setup	Ready to use	Requires training data (audio + transcripts)
Use case	General transcription	Medical dictation, legal transcription, product names

Multimodal audio reasoning

Newer multimodal models can process audio directly — not just transcribe it, but reason about it:

Capability	What It Does	Example
Emotion detection	Identifies speaker emotion from audio	Detect frustrated caller before routing to specialist
Speaker diarization	Distinguishes between speakers (Azure AI Speech feature)	Separate customer and agent in a call recording
Audio event understanding	Understands non-speech audio cues	Detect background noise indicating emergency
Tone analysis	Assesses communication tone	Flag aggressive or threatening tone for escalation

Real-world example: Atlas Financial's voice compliance system

Atlas Financial integrates speech into their compliance agent:

Customer service calls:

STT — Azure Speech transcribes the call in real-time
Agent — Compliance agent monitors the transcript for regulatory keywords
Audio reasoning — Multimodal model detects customer frustration levels
Translation — If the customer switches languages, real-time translation kicks in
TTS — Agent responses are spoken back to the customer in their language

Custom speech model: Trained on financial terminology (SWIFT codes, IBAN numbers, product names) for 98%+ transcription accuracy on banking-specific terms.

Exam tip: Speech translation methods

Three approaches to speech translation:

Azure AI Speech built-in translation — TranslationRecognizer in the Speech SDK. Best for real-time speech-to-speech with lowest latency.
STT + Azure Translator + TTS pipeline — Transcribe, translate text, synthesise. Broader language coverage than built-in.
STT + LLM + TTS — Transcribe, LLM translates with context awareness, synthesise. Best for nuanced/contextual translation.

For real-time call translation → Speech SDK built-in (lowest latency). For many languages → Translator pipeline. For context-heavy professional content → LLM-powered.

Key terms

Question

What is custom speech in Azure AI?

Click or press Enter to reveal answer

Answer

A feature that lets you train speech-to-text models on your domain-specific vocabulary and audio samples. Improves transcription accuracy for industry jargon, product names, and specialised terminology.

Click to flip back

Question

What is multimodal audio reasoning?

Click or press Enter to reveal answer

Answer

The ability of AI models to process and reason about audio content directly — not just transcribe it. Includes detecting emotion, identifying speakers, understanding tone, and analysing non-speech audio cues.

Click to flip back

Question

What is the speech integration pattern for agents?

Click or press Enter to reveal answer

Answer

STT converts user speech to text → agent processes text and generates text response → TTS converts response to speech. The agent's reasoning logic is the same as a text agent — speech is the I/O modality.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed builds a phone-based symptom checker. Patients describe symptoms by speaking, and the AI responds with guidance. The system needs to accurately transcribe medical terms like 'acetaminophen' and 'tachycardia'. Which approach should they use?

Knowledge Check

Kai's logistics call centre handles calls in English, Mandarin, and Hindi. They need real-time translation between all three languages during customer calls. Which approach is most practical?