πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 1
Domain 1 β€” Module 8 of 11 73%
8 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 1: AI Concepts and Capabilities Premium ⏱ ~12 min read

Speech: Recognition & Synthesis

AI can listen and talk. Speech recognition converts voice to text; speech synthesis converts text to voice. Learn how both work, when to use each, and what Azure offers.

How does AI handle speech?

β˜• Simple explanation

Speech AI works in two directions: listening and talking.

Listening (speech recognition) is like a court reporter β€” it hears what you say and types it out. Your phone’s voice typing, meeting transcription, and voice assistants all use this.

Talking (speech synthesis) is like a narrator β€” it reads text and speaks it aloud in a natural voice. GPS directions, audiobook readers, and accessibility tools all use this.

Modern speech AI is remarkably accurate β€” it handles accents, background noise, and even multiple speakers.

Speech AI encompasses two complementary capabilities. Speech-to-text (STT) β€” also called automatic speech recognition (ASR) β€” uses acoustic and language models to convert spoken audio into written text. Text-to-speech (TTS) β€” also called speech synthesis β€” uses neural voice models to generate natural-sounding audio from written text.

Azure provides these through Azure AI Speech (part of Foundry Tools), which supports 100+ languages, custom voice models, real-time transcription, and conversation transcription with speaker diarisation.

Speech recognition (speech-to-text)

Converts spoken language into written text.

Key speech recognition features in Azure AI Speech
FeatureHow It Works
Real-time transcriptionConverts speech to text as it happens β€” ideal for live meetings, captions
Batch transcriptionProcesses pre-recorded audio files β€” ideal for call centre recordings, podcasts
Speaker diarisationIdentifies who is speaking β€” 'Speaker 1 said... Speaker 2 replied...'
Custom speech modelsFine-tune recognition for industry terms, accents, or noisy environments
Multi-language support100+ languages and regional variants supported
Pronunciation assessmentEvaluates pronunciation accuracy β€” useful for language learning apps

DataFlow Corp scenario: DataFlow transcribes 10,000 customer support calls per day. They use:

  • Batch transcription to process call recordings overnight
  • Speaker diarisation to separate agent and customer speech
  • Custom speech model trained on their product names and technical terms
ℹ️ What is speaker diarisation?

Diarisation (also spelled β€œdiarization”) is the process of identifying who spoke when in an audio recording with multiple speakers.

Without diarisation:

β€œHello, I need help with my account. Sure, I can help you with that.”

With diarisation:

Speaker 1: β€œHello, I need help with my account.” Speaker 2: β€œSure, I can help you with that.”

This is essential for meeting transcription, interview analysis, and customer support quality review.

Speech synthesis (text-to-speech)

Converts written text into natural-sounding spoken audio.

FeatureDescription
Neural voicesAI-generated voices that sound remarkably human (not robotic)
Custom neural voiceCreate a unique brand voice from training data
SSML controlFine-tune pronunciation, speed, pitch, and emphasis using Speech Synthesis Markup Language
Multi-languageGenerate speech in 100+ languages
Speaking stylesAdjust tone: cheerful, empathetic, newscast, customer service
Audio output formatsMP3, WAV, OGG, and streaming output

MediSpark scenario: MediSpark builds an accessibility feature for visually impaired patients. Their app uses speech synthesis to read appointment details, medication instructions, and lab results aloud β€” using a calm, empathetic speaking style.

Speech translation

Azure AI Speech can also translate spoken language in real-time:

  1. Input: spoken English
  2. Step 1: Speech-to-text (English text)
  3. Step 2: Translation (English text β†’ Spanish text)
  4. Step 3: Text-to-speech (Spanish audio)
  5. Output: spoken Spanish

GreenLeaf scenario: GreenLeaf’s field workers speak multiple languages. During team meetings, Azure Speech Translation provides real-time translation β€” each participant hears the meeting in their preferred language.

🎬 Video walkthrough

🎬 Video coming soon

Speech Recognition & Synthesis β€” AI-901 Module 8

Speech Recognition & Synthesis β€” AI-901 Module 8

~12 min

Flashcards

Question

What is the difference between speech recognition and speech synthesis?

Click or press Enter to reveal answer

Answer

Speech recognition (STT) converts spoken audio into written text. Speech synthesis (TTS) converts written text into spoken audio. They're opposite directions of the same technology.

Click to flip back

Question

What is speaker diarisation?

Click or press Enter to reveal answer

Answer

The process of identifying who spoke when in a multi-speaker audio recording. The output labels each segment with the speaker: 'Speaker 1 said X, Speaker 2 replied Y.'

Click to flip back

Question

What are neural voices?

Click or press Enter to reveal answer

Answer

AI-generated voices that sound remarkably human and natural. Unlike older robotic TTS, neural voices can express emotion, vary pace, and handle pronunciation naturally. Available in Azure AI Speech.

Click to flip back

Question

What is SSML?

Click or press Enter to reveal answer

Answer

Speech Synthesis Markup Language β€” an XML-based language that lets you control pronunciation, speed, pitch, pauses, and emphasis in text-to-speech output.

Click to flip back

Knowledge Check

Knowledge Check

DataFlow Corp needs to transcribe 10,000 recorded customer support calls each night and identify which parts were spoken by the agent vs the customer. Which combination of features do they need?

Knowledge Check

MediSpark builds an accessibility feature that reads appointment details aloud to visually impaired patients in a calm, empathetic tone. Which Azure AI Speech capability is this?


Next up: Computer Vision β€” how AI sees and understands images.

← Previous

Text Analysis: Keywords, Entities & Sentiment

Next β†’

Computer Vision: Seeing the World

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.