πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 2
Domain 2 β€” Module 13 of 15 87%
24 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 2: Implement AI Solutions Using Foundry Premium ⏱ ~12 min read

Multimodal Extraction: Images, Audio & Video

Content Understanding doesn't stop at documents. It can extract structured data from images, audio recordings, and video β€” turning any media into searchable, structured information.

Beyond documents

β˜• Simple explanation

Content Understanding can extract data from anything β€” not just paper documents.

Images: Photograph a product label β†’ extract ingredients, nutrition info, expiry date. Photograph a whiteboard β†’ extract the diagram and text.

Audio: Record a meeting β†’ extract action items, decisions, speaker names. Record a customer call β†’ extract account number, issue type, resolution.

Video: Film a training session β†’ extract slide content, key topics, timestamps. Record a presentation β†’ extract each slide’s text and speaker notes.

Azure Content Understanding provides multimodal extraction across images, audio, and video. For images, it combines OCR and object recognition to extract structured fields. For audio, it uses speech recognition with semantic understanding. For video, it combines visual analysis, OCR, speech transcription, and scene detection to extract comprehensive structured data.

Image extraction

Beyond documents, Content Understanding processes photographs and images:

Image TypeWhat’s Extracted
Product labelsBrand, ingredients, nutrition facts, warnings, barcodes
WhiteboardsHandwritten text, diagrams, sketches
ScreenshotsUI text, form data, error messages
Signs and postersTitle, body text, contact info
Retail shelvesProduct names, prices, positions

GreenLeaf scenario: GreenLeaf photographs product labels on incoming seed packages. Content Understanding extracts the seed variety, planting instructions, expiry date, and lot number β€” automatically populating their inventory system.

Audio extraction

Content Understanding processes audio recordings to extract structured information:

Audio SourceWhat’s Extracted
MeetingsKey topics, action items, decisions, speakers
Customer callsAccount info, issue category, sentiment, resolution
InterviewsQuestions asked, responses, key quotes
VoicemailsCaller name, callback number, purpose

The process:

  1. Speech recognition β€” transcribes the audio
  2. Speaker diarisation β€” identifies who said what
  3. Semantic extraction β€” pulls out structured fields (topics, actions, entities)

DataFlow Corp scenario: DataFlow records 10,000 customer support calls daily. Content Understanding extracts: customer account number (spoken), issue category, steps the agent took, resolution status, and customer satisfaction (inferred from tone).

Video extraction

Video combines visual AND audio extraction:

Video SourceWhat’s Extracted
Training videosSlide text, spoken content, key topics, timestamps
Security footageEvents, movements, anomalies, timestamps
PresentationsSlide content, speaker narrative, Q&A sections
Product demosFeature descriptions, UI text, spoken explanations

The process:

  1. Scene detection β€” identifies key moments and transitions
  2. Slide extraction β€” captures on-screen text and slides
  3. Speech transcription β€” transcribes spoken content
  4. Semantic synthesis β€” combines visual and audio into structured output
ℹ️ Multimodal extraction = RAG gold mine

Multimodal extraction is incredibly powerful for building RAG (Retrieval-Augmented Generation) systems:

  • Extract text from all company documents β†’ searchable
  • Transcribe all meeting recordings β†’ searchable
  • Extract slides from all training videos β†’ searchable
  • Extract data from product images β†’ searchable

Now your AI agent can search across ALL company knowledge β€” documents, meetings, videos, images β€” from a single query.

Exam relevance: Understanding how Content Understanding feeds into RAG systems connects information extraction to generative AI and agents.

Comparing extraction across modalities

Content Understanding across four modalities
FeatureKey TechniqueOutput Example
πŸ“„ DocumentsOCR + layout analysis + field mappingVendor: GreenLeaf, Total: $3,400, Date: 15 May 2026
πŸ–ΌοΈ ImagesOCR + object recognition + field mappingProduct: Tomato Seeds, Expiry: Dec 2026, Lot: A4521
πŸŽ™οΈ AudioSpeech recognition + diarisation + semantic extractionSpeaker 1: reported billing issue, Action: refund processed
🎬 VideoScene detection + OCR + speech + semantic synthesisSlide 3: 'Q2 Revenue: $4.2M', Speaker: 'We exceeded targets by 15%'

🎬 Video walkthrough

🎬 Video coming soon

Multimodal Extraction β€” AI-901 Module 24

Multimodal Extraction β€” AI-901 Module 24

~12 min

Flashcards

Question

What three types of media can Content Understanding extract data from (beyond documents)?

Click or press Enter to reveal answer

Answer

Images (product labels, whiteboards, screenshots), Audio (meetings, calls, voicemails), and Video (training, presentations, demos). Each uses different techniques but produces structured data.

Click to flip back

Question

How does Content Understanding process video?

Click or press Enter to reveal answer

Answer

Four steps: 1) Scene detection (key moments), 2) Slide extraction (on-screen text), 3) Speech transcription (spoken content), 4) Semantic synthesis (combines visual + audio into structured output).

Click to flip back

Question

How does multimodal extraction support RAG systems?

Click or press Enter to reveal answer

Answer

By making all company knowledge searchable β€” documents, meeting recordings, training videos, product images β€” a single query can find relevant information across ALL modalities. This powers comprehensive AI agents and chatbots.

Click to flip back

Knowledge Check

Knowledge Check

DataFlow Corp wants to extract action items and decisions from their weekly team meeting recordings. Which Content Understanding capability handles this?

Knowledge Check

MediSpark wants to make their entire training video library searchable. Doctors should be able to type a question and find the exact video moment that answers it. What's the best approach?


Next up: Building an Extraction App β€” putting Content Understanding into a complete application.

← Previous

Content Understanding: Documents & Forms

Next β†’

Building an Extraction App

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.