Azure Speech in Foundry Tools
Build a lightweight speech app using Azure AI Speech β the dedicated service for speech recognition, synthesis, and translation within Foundry Tools.
Building with Azure AI Speech
Azure AI Speech is like giving your app ears and a voice.
In the last module, you used GPT-4o to process audio directly. This module uses Azure AI Speech β a dedicated service thatβs optimised specifically for speech tasks. Itβs faster for pure transcription, supports 100+ languages, and gives you fine-grained control over voice output.
Think of it as the specialist vs the generalist: GPT-4o can do everything, but Azure Speech does speech tasks better and cheaper.
Building a speech-to-text app
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="your-speech-key",
region="your-region"
)
speech_config.speech_recognition_language = "en-NZ"
# Recognise from microphone
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)
print("Speak now...")
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"You said: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
print("Speech not recognised")
Building a text-to-speech app
speech_config = speechsdk.SpeechConfig(
subscription="your-speech-key",
region="your-region"
)
# Choose a neural voice
speech_config.speech_synthesis_voice_name = "en-NZ-MollyNeural"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text("Welcome to MediSpark. Your appointment is confirmed for Tuesday at 2 PM.")
Using SSML for fine-grained control
SSML (Speech Synthesis Markup Language) lets you control how the AI speaks:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-NZ">
<voice name="en-NZ-MollyNeural">
<prosody rate="slow" pitch="+5%">
Welcome to MediSpark.
</prosody>
<break time="500ms"/>
Your appointment is confirmed for
<emphasis level="strong">Tuesday at 2 PM</emphasis>.
</voice>
</speak>
| SSML Element | What It Controls |
|---|---|
prosody | Speed (rate), pitch, and volume |
break | Pauses between phrases |
emphasis | Stress on specific words |
voice | Which neural voice to use |
say-as | How to pronounce dates, numbers, addresses |
Combining speech with AI
The most powerful pattern combines Azure Speech with an LLM:
User speaks β Azure Speech (STT) β GPT-4o (reasoning) β Azure Speech (TTS) β AI responds aloud
MediSpark scenario: MediSpark builds a voice-enabled patient assistant:
- Patient speaks: βWhen is my next appointment?β
- Azure Speech transcribes the question
- GPT-4o queries the appointment system and generates a response
- Azure Speech reads the response aloud in a warm, empathetic neural voice
Continuous recognition for long conversations
recognize_once() listens for a single phrase. For ongoing conversations, use continuous recognition:
recognizer.start_continuous_recognition()
# ... recognizer fires events as speech is detected
recognizer.stop_continuous_recognition()This is essential for meeting transcription, live captioning, and voice-controlled applications where the user speaks continuously.
π¬ Video walkthrough
π¬ Video coming soon
Azure Speech in Foundry β AI-901 Module 19
Azure Speech in Foundry β AI-901 Module 19
~14 minFlashcards
Knowledge Check
MediSpark wants their patient assistant to speak appointment confirmations in a calm, slow pace with emphasis on the date and time. Which Azure Speech feature enables this level of control?
DataFlow Corp needs to transcribe a 2-hour recorded meeting, identifying who said what. Which Azure Speech features do they need?
Next up: Visual Prompts β sending images to AI and getting intelligent responses.