Building a Vision App
Combine image analysis capabilities into a complete application. Use Azure AI Vision to classify images, detect objects, and read text β all from Python.
Building with Azure AI Vision
Module 20 used GPT-4o to answer questions about images. This module uses Azure AI Vision β a dedicated service thatβs faster and cheaper for specific vision tasks.
Think of the difference: GPT-4o is like a brilliant friend who can discuss anything about an image. Azure AI Vision is like a specialist tool β itβs optimised for reading text (OCR), detecting objects, and classifying images with high speed and accuracy.
Azure AI Vision vs GPT-4o for vision
| Feature | Azure AI Vision | GPT-4o Visual Prompts |
|---|---|---|
| Best for | High-volume classification, OCR, object detection | Complex visual reasoning, open-ended questions |
| Output | Structured JSON (tags, objects, text) | Natural language response |
| Cost | Lower per-transaction | Higher per-token |
| Custom models | Yes β Custom Vision service for your own classifiers | No β uses general knowledge |
| Speed | Fast β optimised for vision | Slower β processes full LLM pipeline |
Image analysis with the SDK
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
client = ImageAnalysisClient(
endpoint="https://your-vision-resource.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-key")
)
# Analyse an image
result = client.analyze(
image_url="https://example.com/field-photo.jpg",
visual_features=[
VisualFeatures.CAPTION,
VisualFeatures.TAGS,
VisualFeatures.OBJECTS,
VisualFeatures.READ
]
)
# Caption
print(f"Caption: {result.caption.text} (confidence: {result.caption.confidence:.2f})")
# Tags
for tag in result.tags.list:
print(f"Tag: {tag.name} ({tag.confidence:.2f})")
# Objects detected
for obj in result.objects.list:
print(f"Object: {obj.tags[0].name} at [{obj.bounding_box}]")
# Text (OCR)
for block in result.read.blocks:
for line in block.lines:
print(f"Text: {line.text}")
Visual features explained
| Feature | What It Returns | Use Case |
|---|---|---|
| CAPTION | A natural language description of the image | Accessibility, image cataloguing |
| TAGS | List of keywords describing the content | Search indexing, content tagging |
| OBJECTS | Detected objects with bounding boxes | Quality control, inventory counting |
| READ | Extracted text (OCR) | Document processing, sign reading |
| PEOPLE | Detected people with positions | Crowd analysis, security |
| SMART_CROPS | Suggested crop regions for thumbnails | Social media, responsive images |
GreenLeaf scenario: GreenLeaf builds a crop health monitoring app:
- Farmer uploads field photo via mobile app
- TAGS β identifies plant types, soil conditions
- OBJECTS β counts individual plants, locates problem areas
- CAPTION β generates a description for the report
π¬ Video walkthrough
π¬ Video coming soon
Building a Vision App β AI-901 Module 22
Building a Vision App β AI-901 Module 22
~14 minFlashcards
Knowledge Check
GreenLeaf wants to build an app that processes 5,000 field photos daily, tagging each with the type of crop visible. Which approach is most cost-effective?
DataFlow Corp receives scanned business documents. They need to: 1) extract all text, 2) identify what objects appear in any embedded photos, and 3) generate a description of each page. Which visual features do they request?
Next up: Content Understanding: Documents & Forms β extracting structured data from invoices, receipts, and forms.