🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 3
Domain 3 — Module 2 of 3 67%
21 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 3: Implement Computer Vision Solutions Premium ⏱ ~14 min read

Multimodal Visual Understanding

AI that can see. Learn how to build solutions that analyse images, generate captions, answer questions grounded in visual evidence, and process video using Content Understanding and multimodal models.

Teaching AI to see

☕ Simple explanation

Visual understanding is like giving AI eyes — it can look at photos and videos and tell you what’s in them, answer questions about what it sees, and describe images for people who can’t see them.

Two approaches: (1) Multimodal models like GPT-4o that can “look” at images alongside text, and (2) Content Understanding pipelines that extract structured data from visual content.

Visual understanding in the AI-103 exam covers two complementary technologies:

  • Multimodal models (GPT-4o, Llama 4) — general-purpose image understanding, captioning, visual Q&A, and reasoning about visual content
  • Content Understanding pipelines — structured extraction of visual characteristics using configurable single-task or pro-mode pipelines

Use multimodal models for open-ended reasoning about images. Use Content Understanding for structured, repeatable extraction tasks.

Visual understanding capabilities

CapabilityApproachUse Case
Image captioningMultimodal model”A team of doctors reviewing patient charts in a modern hospital”
Visual Q&AMultimodal model”How many people are in this image?” → “Five”
Alt-text generationMultimodal model + accessibility guidelinesScreen reader descriptions for web images
Object detectionContent UnderstandingIdentify and locate specific objects within images
Video analysisContent Understanding + multimodal modelProcess video segments for events, objects, actions
Visual characteristic extractionContent UnderstandingExtract colours, textures, dimensions from product photos

Multimodal models for visual understanding

Caption types and visual Q&A
FeatureConcise CaptionDetailed CaptionVisual Q&A
OutputOne sentence describing the imageMulti-sentence rich descriptionDirect answer to a specific question
Prompt'Describe this image briefly''Provide a detailed description of everything in this image''What colour is the car in the foreground?'
Use caseSocial media alt-textAccessibility, detailed documentationGrounded question-answering in visual context
Token costLowMediumLow (per question)

Content Understanding for vision

Content Understanding provides structured visual extraction using analyzers with custom schemas:

Pipeline ModeWhat It DoesBest For
Single-taskOne extraction task per pipelineSimple, high-volume tasks (OCR, field extraction)
Pro modeMultiple capabilities combinedComplex analysis (OCR + layout + field extraction in one pass)
CapabilityWhat It Extracts
OCRPrinted and handwritten text from images
Layout analysisTables, headings, structure within visual documents
Field extractionNamed fields from visual content via custom analyzer schemas
Visual characteristicsColours, textures, dimensions from images via custom schema
Text in imagesAny text visible in the image (signs, labels, documents)

Note: Object detection (bounding boxes), scene classification, and face detection are capabilities of multimodal models (GPT-4o) or legacy Azure AI Vision — not Content Understanding. The exam tests this distinction.

ℹ️ Real-world example: NeuralMed's medical image analysis

NeuralMed uses both approaches for different visual tasks:

Multimodal model (GPT-4o):

  • Generates patient-friendly descriptions of medical diagrams
  • Answers doctor questions about X-ray images: “Is there any abnormality in the left lung?”
  • Creates accessibility alt-text for the patient portal

Content Understanding (pro mode pipeline):

  • Extracts structured data from lab report photos: test name, value, normal range, flag
  • Processes insurance card photos: member ID, group number, provider name
  • Detects and reads text in uploaded prescription images

Accessibility: alt-text and image descriptions

StandardRequirementImplementation
Concise alt-text1-2 sentences, describes the image purpose”A bar chart showing quarterly revenue growth from Q1 to Q4 2025”
Extended descriptionDetailed account for complex imagesFull description of chart data, trends, and key takeaways
WCAG complianceWeb Content Accessibility GuidelinesAlt-text for all informational images, decorative images marked as such
💡 Exam tip: Alt-text vs caption

The exam distinguishes:

  • Alt-text = describes the image’s purpose for accessibility (screen readers)
  • Caption = describes what’s visible in the image for understanding

For a chart: alt-text says “Bar chart showing Q4 revenue is highest at $2.3M.” A caption might say “A blue and grey bar chart with four bars of increasing height.”

Key terms

Question

What is visual Q&A (Visual Question Answering)?

Click or press Enter to reveal answer

Answer

Using a multimodal model to answer specific questions about an image. The model sees the image and the question, then generates an answer grounded in visual evidence. Example: 'How many people are in this photo?' → 'Three.'

Click to flip back

Question

What is single-task vs pro-mode in Content Understanding?

Click or press Enter to reveal answer

Answer

Single-task mode runs one extraction capability per pipeline (e.g., just object detection). Pro mode combines multiple capabilities in one pipeline (object detection + text extraction + classification). Pro mode is more powerful but uses more compute.

Click to flip back

Question

What is alt-text in accessibility?

Click or press Enter to reveal answer

Answer

A text description of an image for screen readers and users who cannot see the image. Should describe the image's purpose, not just its appearance. Required by WCAG accessibility guidelines for all informational images.

Click to flip back

Knowledge check

Knowledge Check

MediaForge needs to automatically generate alt-text for 10,000 marketing images on their client's website to meet WCAG accessibility standards. Which approach is most appropriate?

Knowledge Check

NeuralMed needs to process uploaded lab report photos to extract test names, values, and normal ranges into a structured database. Which approach is best?

🎬 Video coming soon

← Previous

Image & Video Generation

Next →

Responsible AI for Visual Content

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.