πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 2
Domain 2 β€” Module 7 of 15 47%
18 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 2: Implement AI Solutions Using Foundry Premium ⏱ ~12 min read

Multimodal: Responding to Speech

Modern AI models can hear you. Learn how to send spoken prompts to a multimodal model and get intelligent responses β€” combining speech recognition with AI reasoning.

AI that listens and responds

β˜• Simple explanation

Instead of typing your question, you can speak it β€” and the AI responds intelligently.

Think of it like talking to a voice assistant, but much smarter. You say: β€œLook at this chart and tell me what the trend is.” The AI hears your voice, understands the question, looks at the chart, and gives you a thoughtful answer β€” all in one step.

This is possible because multimodal models like GPT-4o can process audio input alongside text and images.

Multimodal models like GPT-4o natively support audio input, enabling direct speech-to-AI interaction without a separate speech-to-text step. The model processes the audio waveform directly, understanding not just the words but also tone, emphasis, and speaking patterns.

This differs from the traditional pipeline (Speech-to-text β†’ LLM β†’ Text-to-speech) by processing audio as a first-class input modality.

Two approaches to speech + AI

Traditional speech pipeline vs multimodal audio
FeatureTraditional PipelineMultimodal (GPT-4o)
How it worksSpeech β†’ Text β†’ LLM β†’ Text β†’ Speech (3 separate steps)Speech β†’ GPT-4o β†’ Response (direct audio processing)
Services neededAzure Speech + Azure OpenAI (separate services)GPT-4o multimodal (one model)
LatencyHigher (multiple API calls)Lower (single call)
NuanceLoses tone, emphasis, emotion in transcriptionPreserves audio nuance β€” can understand tone and intent
Best forWhen you need the transcript AND the AI responseWhen you want natural, conversational AI interaction

Using GPT-4o with audio input

import base64

# Read audio file
with open("question.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

response = chat.complete(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Respond naturally to spoken questions."},
        {"role": "user", "content": [
            {"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}
        ]}
    ]
)

print(response.choices[0].message.content)

What’s happening:

  • The audio file is encoded as base64 and sent directly to GPT-4o
  • The model processes the audio natively β€” no separate speech-to-text step
  • The response is text (or can be audio in supported configurations)

DataFlow Corp scenario: DataFlow builds a voice-enabled analytics dashboard. Managers speak queries like β€œWhat were our top-selling products last quarter?” GPT-4o understands the spoken question, queries the data, and responds with the answer.

πŸ’‘ When to use traditional pipeline vs multimodal

Use the traditional pipeline (Speech + LLM) when:

  • You need the transcript for records or compliance
  • You need custom speech recognition (industry terms, accents)
  • You need speech translation between languages
  • Budget is tight (dedicated speech service can be cheaper)

Use multimodal (GPT-4o) when:

  • You want the simplest possible architecture
  • Tone and emotional context matter for the response
  • Low latency is critical
  • You’re already using GPT-4o for other modalities (text, images)

🎬 Video walkthrough

🎬 Video coming soon

Multimodal Speech β€” AI-901 Module 18

Multimodal Speech β€” AI-901 Module 18

~12 min

Flashcards

Question

How can GPT-4o process spoken questions?

Click or press Enter to reveal answer

Answer

GPT-4o is multimodal β€” it can accept audio as a direct input modality. The audio is encoded as base64 and sent in the messages array. The model processes the audio waveform directly without needing a separate speech-to-text service.

Click to flip back

Question

What is the advantage of multimodal audio over a traditional speech pipeline?

Click or press Enter to reveal answer

Answer

Lower latency (single API call vs three), preserves audio nuance (tone, emphasis), and simpler architecture. The traditional pipeline requires separate Speech-to-text β†’ LLM β†’ Text-to-speech services.

Click to flip back

Question

When should you use the traditional speech pipeline instead of multimodal?

Click or press Enter to reveal answer

Answer

When you need the transcript for records, need custom speech recognition for industry terms, need speech translation, or when budget is tight (dedicated speech service can be cheaper).

Click to flip back

Knowledge Check

Knowledge Check

MediSpark's doctors want to dictate clinical notes and have AI summarise them. They also need the raw transcript saved to patient records. Which approach should they use?

Knowledge Check

DataFlow Corp wants a voice-enabled dashboard where managers ask spoken questions and get instant answers. Tone of voice should influence the response style. What's the best approach?


Next up: Azure Speech in Foundry Tools β€” building apps with dedicated speech services.

← Previous

Building a Text Analysis App

Next β†’

Azure Speech in Foundry Tools

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.