Multimodal Visual Understanding

Teaching AI to see

Simple explanation

Visual understanding is like giving AI eyes — it can look at photos and videos and tell you what’s in them, answer questions about what it sees, and describe images for people who can’t see them.

Two approaches: (1) Multimodal models like GPT-4o that can “look” at images alongside text, and (2) Content Understanding pipelines that extract structured data from visual content.

Visual understanding capabilities

Capability	Approach	Use Case
Image captioning	Multimodal model	”A team of doctors reviewing patient charts in a modern hospital”
Visual Q&A	Multimodal model	”How many people are in this image?” → “Five”
Alt-text generation	Multimodal model + accessibility guidelines	Screen reader descriptions for web images
Object detection	Content Understanding	Identify and locate specific objects within images
Video analysis	Content Understanding + multimodal model	Process video segments for events, objects, actions
Visual characteristic extraction	Content Understanding	Extract colours, textures, dimensions from product photos

Multimodal models for visual understanding

Caption types and visual Q&A
Feature	Concise Caption	Detailed Caption	Visual Q&A
Output	One sentence describing the image	Multi-sentence rich description	Direct answer to a specific question
Prompt	'Describe this image briefly'	'Provide a detailed description of everything in this image'	'What colour is the car in the foreground?'
Use case	Social media alt-text	Accessibility, detailed documentation	Grounded question-answering in visual context
Token cost	Low	Medium	Low (per question)

Content Understanding for vision

Content Understanding provides structured visual extraction using analyzers with custom schemas:

Pipeline Mode	What It Does	Best For
Single-task	One extraction task per pipeline	Simple, high-volume tasks (OCR, field extraction)
Pro mode	Multiple capabilities combined	Complex analysis (OCR + layout + field extraction in one pass)

Capability	What It Extracts
OCR	Printed and handwritten text from images
Layout analysis	Tables, headings, structure within visual documents
Field extraction	Named fields from visual content via custom analyzer schemas
Visual characteristics	Colours, textures, dimensions from images via custom schema
Text in images	Any text visible in the image (signs, labels, documents)

Note: Object detection (bounding boxes), scene classification, and face detection are capabilities of multimodal models (GPT-4o) or legacy Azure AI Vision — not Content Understanding. The exam tests this distinction.

Real-world example: NeuralMed's medical image analysis

NeuralMed uses both approaches for different visual tasks:

Multimodal model (GPT-4o):

Generates patient-friendly descriptions of medical diagrams
Answers doctor questions about X-ray images: “Is there any abnormality in the left lung?”
Creates accessibility alt-text for the patient portal

Content Understanding (pro mode pipeline):

Extracts structured data from lab report photos: test name, value, normal range, flag
Processes insurance card photos: member ID, group number, provider name
Detects and reads text in uploaded prescription images

Accessibility: alt-text and image descriptions

Standard	Requirement	Implementation
Concise alt-text	1-2 sentences, describes the image purpose	”A bar chart showing quarterly revenue growth from Q1 to Q4 2025”
Extended description	Detailed account for complex images	Full description of chart data, trends, and key takeaways
WCAG compliance	Web Content Accessibility Guidelines	Alt-text for all informational images, decorative images marked as such

Exam tip: Alt-text vs caption

The exam distinguishes:

Alt-text = describes the image’s purpose for accessibility (screen readers)
Caption = describes what’s visible in the image for understanding

For a chart: alt-text says “Bar chart showing Q4 revenue is highest at $2.3M.” A caption might say “A blue and grey bar chart with four bars of increasing height.”

Key terms

Question

What is visual Q&A (Visual Question Answering)?

Click or press Enter to reveal answer

Answer

Using a multimodal model to answer specific questions about an image. The model sees the image and the question, then generates an answer grounded in visual evidence. Example: 'How many people are in this photo?' → 'Three.'

Click to flip back

Question

What is single-task vs pro-mode in Content Understanding?

Click or press Enter to reveal answer

Answer

Single-task mode runs one extraction capability per pipeline (e.g., just object detection). Pro mode combines multiple capabilities in one pipeline (object detection + text extraction + classification). Pro mode is more powerful but uses more compute.

Click to flip back

Question

What is alt-text in accessibility?

Click or press Enter to reveal answer

Answer

A text description of an image for screen readers and users who cannot see the image. Should describe the image's purpose, not just its appearance. Required by WCAG accessibility guidelines for all informational images.

Click to flip back

Knowledge check

Knowledge Check

MediaForge needs to automatically generate alt-text for 10,000 marketing images on their client's website to meet WCAG accessibility standards. Which approach is most appropriate?

Knowledge Check

NeuralMed needs to process uploaded lab report photos to extract test names, values, and normal ranges into a structured database. Which approach is best?