Speech: Recognition & Synthesis
AI can listen and talk. Speech recognition converts voice to text; speech synthesis converts text to voice. Learn how both work, when to use each, and what Azure offers.
How does AI handle speech?
Speech AI works in two directions: listening and talking.
Listening (speech recognition) is like a court reporter β it hears what you say and types it out. Your phoneβs voice typing, meeting transcription, and voice assistants all use this.
Talking (speech synthesis) is like a narrator β it reads text and speaks it aloud in a natural voice. GPS directions, audiobook readers, and accessibility tools all use this.
Modern speech AI is remarkably accurate β it handles accents, background noise, and even multiple speakers.
Speech recognition (speech-to-text)
Converts spoken language into written text.
| Feature | How It Works |
|---|---|
| Real-time transcription | Converts speech to text as it happens β ideal for live meetings, captions |
| Batch transcription | Processes pre-recorded audio files β ideal for call centre recordings, podcasts |
| Speaker diarisation | Identifies who is speaking β 'Speaker 1 said... Speaker 2 replied...' |
| Custom speech models | Fine-tune recognition for industry terms, accents, or noisy environments |
| Multi-language support | 100+ languages and regional variants supported |
| Pronunciation assessment | Evaluates pronunciation accuracy β useful for language learning apps |
DataFlow Corp scenario: DataFlow transcribes 10,000 customer support calls per day. They use:
- Batch transcription to process call recordings overnight
- Speaker diarisation to separate agent and customer speech
- Custom speech model trained on their product names and technical terms
What is speaker diarisation?
Diarisation (also spelled βdiarizationβ) is the process of identifying who spoke when in an audio recording with multiple speakers.
Without diarisation:
βHello, I need help with my account. Sure, I can help you with that.β
With diarisation:
Speaker 1: βHello, I need help with my account.β Speaker 2: βSure, I can help you with that.β
This is essential for meeting transcription, interview analysis, and customer support quality review.
Speech synthesis (text-to-speech)
Converts written text into natural-sounding spoken audio.
| Feature | Description |
|---|---|
| Neural voices | AI-generated voices that sound remarkably human (not robotic) |
| Custom neural voice | Create a unique brand voice from training data |
| SSML control | Fine-tune pronunciation, speed, pitch, and emphasis using Speech Synthesis Markup Language |
| Multi-language | Generate speech in 100+ languages |
| Speaking styles | Adjust tone: cheerful, empathetic, newscast, customer service |
| Audio output formats | MP3, WAV, OGG, and streaming output |
MediSpark scenario: MediSpark builds an accessibility feature for visually impaired patients. Their app uses speech synthesis to read appointment details, medication instructions, and lab results aloud β using a calm, empathetic speaking style.
Speech translation
Azure AI Speech can also translate spoken language in real-time:
- Input: spoken English
- Step 1: Speech-to-text (English text)
- Step 2: Translation (English text β Spanish text)
- Step 3: Text-to-speech (Spanish audio)
- Output: spoken Spanish
GreenLeaf scenario: GreenLeafβs field workers speak multiple languages. During team meetings, Azure Speech Translation provides real-time translation β each participant hears the meeting in their preferred language.
π¬ Video walkthrough
π¬ Video coming soon
Speech Recognition & Synthesis β AI-901 Module 8
Speech Recognition & Synthesis β AI-901 Module 8
~12 minFlashcards
Knowledge Check
DataFlow Corp needs to transcribe 10,000 recorded customer support calls each night and identify which parts were spoken by the agent vs the customer. Which combination of features do they need?
MediSpark builds an accessibility feature that reads appointment details aloud to visually impaired patients in a calm, empathetic tone. Which Azure AI Speech capability is this?
Next up: Computer Vision β how AI sees and understands images.