Computer Vision: Seeing the World

How does AI see?

Simple explanation

Computer vision lets AI look at images and understand what’s in them — just like you do, but at scale.

When you look at a photo, your brain instantly recognises faces, reads signs, notices objects. Computer vision does the same thing using AI models trained on millions of labelled images.

The difference? AI can process thousands of images per second. A human quality inspector checks maybe 60 items per hour. A vision AI checks 60 per second.

Computer vision capabilities

Key computer vision capabilities
Feature	What It Does	Example
Image classification	Assigns labels/categories to an image	'This is a photo of a cat' or 'This X-ray shows pneumonia'
Object detection	Finds and locates specific objects within an image with bounding boxes	Counting people in a room, detecting products on a shelf
Image description	Generates a natural language description of what's in the image	'A woman in a white coat examining an X-ray on a lightbox'
OCR (Optical Character Recognition)	Reads and extracts text from images	Reading a licence plate, extracting text from a scanned document
Face detection	Detects human faces and attributes (head pose, glasses, blur)	Security cameras, photo organisation, accessibility
Spatial analysis	Analyses movement and positioning in video	Counting foot traffic in a store, social distancing monitoring

Image classification: is this a cat or a dog?

The simplest vision task — the model looks at an image and assigns it to a category.

Use Case	Input	Output
Medical imaging	X-ray of a lung	”Pneumonia detected” or “Normal”
Quality control	Photo of a product	”Pass” or “Defect detected”
Content moderation	Uploaded image	”Safe” or “Contains violence”

MediSpark scenario: MediSpark trains a classification model to sort dermatology images into categories: benign, monitor, urgent referral. Each category triggers a different workflow in their patient management system.

Object detection: what’s in the picture and where?

Goes beyond classification — it identifies specific objects and marks their location with bounding boxes.

GreenLeaf scenario: GreenLeaf uses object detection on drone photos of their fields:

Detects individual plants
Identifies weeds vs crops
Counts healthy vs diseased plants
Maps problem areas for targeted treatment

OCR: reading text from images

Optical Character Recognition (OCR) extracts text from images — printed text, handwriting, signs, documents.

Source	What OCR Reads
Scanned documents	Full page text, tables, headers
Business cards	Name, phone, email, company
Street signs	Road names, directions
Handwritten notes	Handwriting (with varying accuracy)
Receipts	Items, prices, totals

Key exam concept: OCR is the bridge between the physical and digital world. It’s a computer vision capability, but its output feeds into text analysis and information extraction workflows.

OCR vs Content Understanding

OCR and Content Understanding (Module 11) are related but different:

OCR	Content Understanding
Extracts raw text from images	Extracts structured fields from documents
Output: “Dr. Sarah Chen, DOB 15/03/1985”	Output: structured JSON with named fields like name, dob
Doesn’t understand what the text means	Understands document structure and field meanings
General-purpose	Trained for specific document types

The exam may test whether you know when to use simple OCR vs full Content Understanding.

Azure AI Vision capabilities

Azure AI Vision (Foundry Tools) provides:

Capability	API
Image analysis (tags, description, objects, people)	Image Analysis 4.0
OCR	Read API
Face detection	Face API
Custom models	Custom Vision (train your own classifier)
Spatial analysis	Video analysis for movement patterns

🎬 Video walkthrough

Flashcards

Question

What is the difference between image classification and object detection?

Click or press Enter to reveal answer

Answer

Image classification assigns a single label to an entire image ('this is a cat'). Object detection finds and locates specific objects within an image with bounding boxes ('there's a cat at position X and a dog at position Y').

Click to flip back

Question

What is OCR in computer vision?

Click or press Enter to reveal answer

Answer

Optical Character Recognition — extracting readable text from images. It can read printed text, handwriting, signs, documents, and receipts. In Azure, it's available through the Read API.

Click to flip back

Question

What is spatial analysis in computer vision?

Click or press Enter to reveal answer

Answer

Analysing movement and positioning of people in video feeds. Used for foot traffic counting, social distancing monitoring, and zone-based occupancy tracking.

Click to flip back

Question

What are the two main architectures used in modern computer vision?

Click or press Enter to reveal answer

Answer

Convolutional Neural Networks (CNNs) — the traditional approach for image tasks, and Vision Transformers (ViTs) — newer architecture applying transformer attention to image patches.

Click to flip back

Knowledge Check

GreenLeaf uses a drone to photograph their fields. They need AI to count individual plants, identify which are weeds, and mark their exact location in the image. Which computer vision capability is this?

Knowledge Check

DataFlow Corp receives thousands of business cards at conferences. They want to read all the text from photos of each card so they can search and filter it later. Which computer vision capability is most appropriate?

Next up: Image Generation — how AI creates entirely new images from text descriptions.