Multimodal Visual Understanding
AI that can see. Learn how to build solutions that analyse images, generate captions, answer questions grounded in visual evidence, and process video using Content Understanding and multimodal models.
Teaching AI to see
Visual understanding is like giving AI eyes — it can look at photos and videos and tell you what’s in them, answer questions about what it sees, and describe images for people who can’t see them.
Two approaches: (1) Multimodal models like GPT-4o that can “look” at images alongside text, and (2) Content Understanding pipelines that extract structured data from visual content.
Visual understanding capabilities
| Capability | Approach | Use Case |
|---|---|---|
| Image captioning | Multimodal model | ”A team of doctors reviewing patient charts in a modern hospital” |
| Visual Q&A | Multimodal model | ”How many people are in this image?” → “Five” |
| Alt-text generation | Multimodal model + accessibility guidelines | Screen reader descriptions for web images |
| Object detection | Content Understanding | Identify and locate specific objects within images |
| Video analysis | Content Understanding + multimodal model | Process video segments for events, objects, actions |
| Visual characteristic extraction | Content Understanding | Extract colours, textures, dimensions from product photos |
Multimodal models for visual understanding
| Feature | Concise Caption | Detailed Caption | Visual Q&A |
|---|---|---|---|
| Output | One sentence describing the image | Multi-sentence rich description | Direct answer to a specific question |
| Prompt | 'Describe this image briefly' | 'Provide a detailed description of everything in this image' | 'What colour is the car in the foreground?' |
| Use case | Social media alt-text | Accessibility, detailed documentation | Grounded question-answering in visual context |
| Token cost | Low | Medium | Low (per question) |
Content Understanding for vision
Content Understanding provides structured visual extraction using analyzers with custom schemas:
| Pipeline Mode | What It Does | Best For |
|---|---|---|
| Single-task | One extraction task per pipeline | Simple, high-volume tasks (OCR, field extraction) |
| Pro mode | Multiple capabilities combined | Complex analysis (OCR + layout + field extraction in one pass) |
| Capability | What It Extracts |
|---|---|
| OCR | Printed and handwritten text from images |
| Layout analysis | Tables, headings, structure within visual documents |
| Field extraction | Named fields from visual content via custom analyzer schemas |
| Visual characteristics | Colours, textures, dimensions from images via custom schema |
| Text in images | Any text visible in the image (signs, labels, documents) |
Note: Object detection (bounding boxes), scene classification, and face detection are capabilities of multimodal models (GPT-4o) or legacy Azure AI Vision — not Content Understanding. The exam tests this distinction.
Real-world example: NeuralMed's medical image analysis
NeuralMed uses both approaches for different visual tasks:
Multimodal model (GPT-4o):
- Generates patient-friendly descriptions of medical diagrams
- Answers doctor questions about X-ray images: “Is there any abnormality in the left lung?”
- Creates accessibility alt-text for the patient portal
Content Understanding (pro mode pipeline):
- Extracts structured data from lab report photos: test name, value, normal range, flag
- Processes insurance card photos: member ID, group number, provider name
- Detects and reads text in uploaded prescription images
Accessibility: alt-text and image descriptions
| Standard | Requirement | Implementation |
|---|---|---|
| Concise alt-text | 1-2 sentences, describes the image purpose | ”A bar chart showing quarterly revenue growth from Q1 to Q4 2025” |
| Extended description | Detailed account for complex images | Full description of chart data, trends, and key takeaways |
| WCAG compliance | Web Content Accessibility Guidelines | Alt-text for all informational images, decorative images marked as such |
Exam tip: Alt-text vs caption
The exam distinguishes:
- Alt-text = describes the image’s purpose for accessibility (screen readers)
- Caption = describes what’s visible in the image for understanding
For a chart: alt-text says “Bar chart showing Q4 revenue is highest at $2.3M.” A caption might say “A blue and grey bar chart with four bars of increasing height.”
Key terms
Knowledge check
MediaForge needs to automatically generate alt-text for 10,000 marketing images on their client's website to meet WCAG accessibility standards. Which approach is most appropriate?
NeuralMed needs to process uploaded lab report photos to extract test names, values, and normal ranges into a structured database. Which approach is best?
🎬 Video coming soon