Multimodal: Responding to Speech
Modern AI models can hear you. Learn how to send spoken prompts to a multimodal model and get intelligent responses β combining speech recognition with AI reasoning.
AI that listens and responds
Instead of typing your question, you can speak it β and the AI responds intelligently.
Think of it like talking to a voice assistant, but much smarter. You say: βLook at this chart and tell me what the trend is.β The AI hears your voice, understands the question, looks at the chart, and gives you a thoughtful answer β all in one step.
This is possible because multimodal models like GPT-4o can process audio input alongside text and images.
Two approaches to speech + AI
| Feature | Traditional Pipeline | Multimodal (GPT-4o) |
|---|---|---|
| How it works | Speech β Text β LLM β Text β Speech (3 separate steps) | Speech β GPT-4o β Response (direct audio processing) |
| Services needed | Azure Speech + Azure OpenAI (separate services) | GPT-4o multimodal (one model) |
| Latency | Higher (multiple API calls) | Lower (single call) |
| Nuance | Loses tone, emphasis, emotion in transcription | Preserves audio nuance β can understand tone and intent |
| Best for | When you need the transcript AND the AI response | When you want natural, conversational AI interaction |
Using GPT-4o with audio input
import base64
# Read audio file
with open("question.wav", "rb") as f:
audio_data = base64.b64encode(f.read()).decode()
response = chat.complete(
model="gpt4o-deployment",
messages=[
{"role": "system", "content": "You are a helpful assistant. Respond naturally to spoken questions."},
{"role": "user", "content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}
]}
]
)
print(response.choices[0].message.content)
Whatβs happening:
- The audio file is encoded as base64 and sent directly to GPT-4o
- The model processes the audio natively β no separate speech-to-text step
- The response is text (or can be audio in supported configurations)
DataFlow Corp scenario: DataFlow builds a voice-enabled analytics dashboard. Managers speak queries like βWhat were our top-selling products last quarter?β GPT-4o understands the spoken question, queries the data, and responds with the answer.
When to use traditional pipeline vs multimodal
Use the traditional pipeline (Speech + LLM) when:
- You need the transcript for records or compliance
- You need custom speech recognition (industry terms, accents)
- You need speech translation between languages
- Budget is tight (dedicated speech service can be cheaper)
Use multimodal (GPT-4o) when:
- You want the simplest possible architecture
- Tone and emotional context matter for the response
- Low latency is critical
- Youβre already using GPT-4o for other modalities (text, images)
π¬ Video walkthrough
π¬ Video coming soon
Multimodal Speech β AI-901 Module 18
Multimodal Speech β AI-901 Module 18
~12 minFlashcards
Knowledge Check
MediSpark's doctors want to dictate clinical notes and have AI summarise them. They also need the raw transcript saved to patient records. Which approach should they use?
DataFlow Corp wants a voice-enabled dashboard where managers ask spoken questions and get instant answers. Tone of voice should influence the response style. What's the best approach?
Next up: Azure Speech in Foundry Tools β building apps with dedicated speech services.