🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 1
Domain 1 — Module 9 of 11 82%
9 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 1: AI Concepts and Capabilities Premium ⏱ ~14 min read

Computer Vision: Seeing the World

AI can look at a photo and tell you what's in it, read text from images, detect objects, and classify scenes. This module covers all the vision capabilities the exam tests.

How does AI see?

☕ Simple explanation

Computer vision lets AI look at images and understand what’s in them — just like you do, but at scale.

When you look at a photo, your brain instantly recognises faces, reads signs, notices objects. Computer vision does the same thing using AI models trained on millions of labelled images.

The difference? AI can process thousands of images per second. A human quality inspector checks maybe 60 items per hour. A vision AI checks 60 per second.

Computer vision is a field of AI that enables systems to interpret and understand visual content from images and video. Modern computer vision uses deep learning models — particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) — trained on millions of labelled images to recognise patterns, objects, faces, text, and scenes.

Azure provides computer vision through Azure AI Vision (part of Foundry Tools) and through multimodal models like GPT-4o that accept images as input.

Computer vision capabilities

Key computer vision capabilities
FeatureWhat It DoesExample
Image classificationAssigns labels/categories to an image'This is a photo of a cat' or 'This X-ray shows pneumonia'
Object detectionFinds and locates specific objects within an image with bounding boxesCounting people in a room, detecting products on a shelf
Image descriptionGenerates a natural language description of what's in the image'A woman in a white coat examining an X-ray on a lightbox'
OCR (Optical Character Recognition)Reads and extracts text from imagesReading a licence plate, extracting text from a scanned document
Face detectionDetects human faces and attributes (head pose, glasses, blur)Security cameras, photo organisation, accessibility
Spatial analysisAnalyses movement and positioning in videoCounting foot traffic in a store, social distancing monitoring

Image classification: is this a cat or a dog?

The simplest vision task — the model looks at an image and assigns it to a category.

Use CaseInputOutput
Medical imagingX-ray of a lung”Pneumonia detected” or “Normal”
Quality controlPhoto of a product”Pass” or “Defect detected”
Content moderationUploaded image”Safe” or “Contains violence”

MediSpark scenario: MediSpark trains a classification model to sort dermatology images into categories: benign, monitor, urgent referral. Each category triggers a different workflow in their patient management system.

Object detection: what’s in the picture and where?

Goes beyond classification — it identifies specific objects and marks their location with bounding boxes.

GreenLeaf scenario: GreenLeaf uses object detection on drone photos of their fields:

  • Detects individual plants
  • Identifies weeds vs crops
  • Counts healthy vs diseased plants
  • Maps problem areas for targeted treatment

OCR: reading text from images

Optical Character Recognition (OCR) extracts text from images — printed text, handwriting, signs, documents.

SourceWhat OCR Reads
Scanned documentsFull page text, tables, headers
Business cardsName, phone, email, company
Street signsRoad names, directions
Handwritten notesHandwriting (with varying accuracy)
ReceiptsItems, prices, totals

Key exam concept: OCR is the bridge between the physical and digital world. It’s a computer vision capability, but its output feeds into text analysis and information extraction workflows.

💡 OCR vs Content Understanding

OCR and Content Understanding (Module 11) are related but different:

OCRContent Understanding
Extracts raw text from imagesExtracts structured fields from documents
Output: “Dr. Sarah Chen, DOB 15/03/1985”Output: structured JSON with named fields like name, dob
Doesn’t understand what the text meansUnderstands document structure and field meanings
General-purposeTrained for specific document types

The exam may test whether you know when to use simple OCR vs full Content Understanding.

Azure AI Vision capabilities

Azure AI Vision (Foundry Tools) provides:

CapabilityAPI
Image analysis (tags, description, objects, people)Image Analysis 4.0
OCRRead API
Face detectionFace API
Custom modelsCustom Vision (train your own classifier)
Spatial analysisVideo analysis for movement patterns

🎬 Video walkthrough

🎬 Video coming soon

Computer Vision — AI-901 Module 9

Computer Vision — AI-901 Module 9

~14 min

Flashcards

Question

What is the difference between image classification and object detection?

Click or press Enter to reveal answer

Answer

Image classification assigns a single label to an entire image ('this is a cat'). Object detection finds and locates specific objects within an image with bounding boxes ('there's a cat at position X and a dog at position Y').

Click to flip back

Question

What is OCR in computer vision?

Click or press Enter to reveal answer

Answer

Optical Character Recognition — extracting readable text from images. It can read printed text, handwriting, signs, documents, and receipts. In Azure, it's available through the Read API.

Click to flip back

Question

What is spatial analysis in computer vision?

Click or press Enter to reveal answer

Answer

Analysing movement and positioning of people in video feeds. Used for foot traffic counting, social distancing monitoring, and zone-based occupancy tracking.

Click to flip back

Question

What are the two main architectures used in modern computer vision?

Click or press Enter to reveal answer

Answer

Convolutional Neural Networks (CNNs) — the traditional approach for image tasks, and Vision Transformers (ViTs) — newer architecture applying transformer attention to image patches.

Click to flip back

Knowledge Check

Knowledge Check

GreenLeaf uses a drone to photograph their fields. They need AI to count individual plants, identify which are weeds, and mark their exact location in the image. Which computer vision capability is this?

Knowledge Check

DataFlow Corp receives thousands of business cards at conferences. They want to read all the text from photos of each card so they can search and filter it later. Which computer vision capability is most appropriate?


Next up: Image Generation — how AI creates entirely new images from text descriptions.

← Previous

Speech: Recognition & Synthesis

Next →

Image Generation: Creating with AI

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.