🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 4
Domain 4 — Module 2 of 2 100%
24 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 4: Implement Text Analysis Solutions Premium ⏱ ~12 min read

Speech, Translation & Voice Agents

Give your AI apps a voice. Learn how to implement speech-to-text, text-to-speech, custom speech models, multimodal audio reasoning, and integrate speech as an agent modality.

Adding voice to AI

☕ Simple explanation

Speech capabilities let your AI hear and speak — turning a text chatbot into a voice assistant.

Speech-to-text (STT) converts what someone says into text the AI can process. Text-to-speech (TTS) converts the AI’s text response into spoken audio. Put them together with an agent, and you have a voice-powered AI assistant that can take calls, answer questions, and even translate between languages in real-time.

Speech integration for AI-103 covers four areas:

  • Speech-to-text and text-to-speech for agentic interactions using Azure AI Speech
  • Custom speech models for domain-specific vocabulary and accents
  • Multimodal audio reasoning — models that can directly reason about audio content
  • Speech translation using Azure Translator or LLM-powered flows

Speech capabilities overview

CapabilityServiceWhat It Does
Speech-to-text (STT)Azure AI SpeechConverts spoken audio to text transcripts
Text-to-speech (TTS)Azure AI SpeechConverts text responses into natural-sounding speech
Custom speechAzure AI Speech (custom model)STT trained on your vocabulary (medical terms, product names)
Custom voiceAzure AI Speech (custom neural voice)TTS with a voice unique to your brand
Speech translation (built-in)Azure AI Speech (TranslationRecognizer)Real-time speech-to-translated-text or speech-to-speech, built into the Speech SDK
Speech translation (pipeline)STT + Azure Translator + TTSAlternative approach: transcribe, translate text, synthesise — broader language coverage
Audio reasoningMultimodal modelsModel reasons about audio content directly (tone, emotion, context)

Speech in agent workflows

Text vs voice-enabled agents
FeatureText AgentVoice-Enabled Agent
InputUser types a messageUser speaks — STT converts to text
ProcessingAgent reasons on textAgent reasons on text (same as text agent)
OutputAgent returns text responseTTS converts response to speech
Use caseChat widgets, messaging appsPhone systems, accessibility, hands-free
Added complexityNoneSTT accuracy, voice quality, latency

Custom speech models

FeatureStandard SpeechCustom Speech
VocabularyGeneral-purposeTrained on your domain terms
AccuracyGood for common languageExcellent for industry jargon
SetupReady to useRequires training data (audio + transcripts)
Use caseGeneral transcriptionMedical dictation, legal transcription, product names

Multimodal audio reasoning

Newer multimodal models can process audio directly — not just transcribe it, but reason about it:

CapabilityWhat It DoesExample
Emotion detectionIdentifies speaker emotion from audioDetect frustrated caller before routing to specialist
Speaker diarizationDistinguishes between speakers (Azure AI Speech feature)Separate customer and agent in a call recording
Audio event understandingUnderstands non-speech audio cuesDetect background noise indicating emergency
Tone analysisAssesses communication toneFlag aggressive or threatening tone for escalation
ℹ️ Real-world example: Atlas Financial's voice compliance system

Atlas Financial integrates speech into their compliance agent:

Customer service calls:

  1. STT — Azure Speech transcribes the call in real-time
  2. Agent — Compliance agent monitors the transcript for regulatory keywords
  3. Audio reasoning — Multimodal model detects customer frustration levels
  4. Translation — If the customer switches languages, real-time translation kicks in
  5. TTS — Agent responses are spoken back to the customer in their language

Custom speech model: Trained on financial terminology (SWIFT codes, IBAN numbers, product names) for 98%+ transcription accuracy on banking-specific terms.

💡 Exam tip: Speech translation methods

Three approaches to speech translation:

  • Azure AI Speech built-in translation — TranslationRecognizer in the Speech SDK. Best for real-time speech-to-speech with lowest latency.
  • STT + Azure Translator + TTS pipeline — Transcribe, translate text, synthesise. Broader language coverage than built-in.
  • STT + LLM + TTS — Transcribe, LLM translates with context awareness, synthesise. Best for nuanced/contextual translation.

For real-time call translation → Speech SDK built-in (lowest latency). For many languages → Translator pipeline. For context-heavy professional content → LLM-powered.

Key terms

Question

What is custom speech in Azure AI?

Click or press Enter to reveal answer

Answer

A feature that lets you train speech-to-text models on your domain-specific vocabulary and audio samples. Improves transcription accuracy for industry jargon, product names, and specialised terminology.

Click to flip back

Question

What is multimodal audio reasoning?

Click or press Enter to reveal answer

Answer

The ability of AI models to process and reason about audio content directly — not just transcribe it. Includes detecting emotion, identifying speakers, understanding tone, and analysing non-speech audio cues.

Click to flip back

Question

What is the speech integration pattern for agents?

Click or press Enter to reveal answer

Answer

STT converts user speech to text → agent processes text and generates text response → TTS converts response to speech. The agent's reasoning logic is the same as a text agent — speech is the I/O modality.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed builds a phone-based symptom checker. Patients describe symptoms by speaking, and the AI responds with guidance. The system needs to accurately transcribe medical terms like 'acetaminophen' and 'tachycardia'. Which approach should they use?

Knowledge Check

Kai's logistics call centre handles calls in English, Mandarin, and Hindi. They need real-time translation between all three languages during customer calls. Which approach is most practical?

🎬 Video coming soon

← Previous

Text Analysis with Language Models

Next →

Ingestion, Indexing & Grounding Pipelines

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.