Content Understanding: Documents & Forms
Turn messy documents into clean, structured data. Azure Content Understanding reads invoices, receipts, ID cards, and forms β extracting exactly the fields you need.
Extracting data from documents
Content Understanding is like a super-powered data entry clerk β it reads any document and pulls out exactly the data you need.
Hand it a stack of invoices β it gives you a spreadsheet of vendor names, amounts, and dates. Hand it receipts β it gives you items and totals. Hand it ID cards β it gives you names and ID numbers. All automatically, all accurate, all in seconds.
The magic is that it doesnβt just read text (thatβs OCR). It understands the document structure β it knows that the number next to βTotal:β is the total amount, not just a random number.
What Content Understanding extracts
Pre-built document models
Content Understanding includes prebuilt analyzers for common document types:
| Document Type | Fields Extracted |
|---|---|
| Invoices | Vendor name, invoice number, date, line items, subtotal, tax, total, payment terms |
| Receipts | Merchant name, date, items, prices, total, tax, tip |
| ID documents | Name, date of birth, document number, nationality, expiry date |
| Business cards | Name, title, company, phone, email, address |
| Tax forms (W-2, 1099) | Employee/employer info, wages, tax withheld |
| Health insurance cards | Member name, ID, group number, plan type |
Custom analyzers
For documents specific to your business, you can train custom analyzers:
- Upload sample documents
- Label the fields you want to extract
- Train the model
- Deploy and use in your application
GreenLeaf scenario: GreenLeaf receives supplier invoices in 20 different formats. They use the pre-built invoice model to extract vendor, amount, and due date β no training needed. For their custom crop inspection reports, they train a custom model to extract field location, crop type, and health rating.
Building a document extraction app
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint="https://your-resource.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-key")
)
# Analyse an invoice
with open("invoice.pdf", "rb") as f:
result = client.begin_analyze_document(
analyzer_id="prebuilt-invoice",
body=f.read()
).result()
# Extract fields
for document in result.documents:
vendor = document.fields.get("VendorName")
total = document.fields.get("InvoiceTotal")
date = document.fields.get("InvoiceDate")
print(f"Vendor: {vendor.content if vendor else 'N/A'}")
print(f"Total: {total.content if total else 'N/A'}")
print(f"Date: {date.content if date else 'N/A'}")
How it works under the hood
Content Understanding processes documents in layers:
| Layer | What Happens |
|---|---|
| 1. OCR | Reads all text from the document (printed and handwritten) |
| 2. Layout analysis | Identifies tables, headers, paragraphs, sections, and page structure |
| 3. Field mapping | Maps specific text regions to named fields based on the model |
| 4. Confidence scoring | Each extracted field includes a confidence score (0.0 to 1.0) |
| 5. Validation | Checks formats β dates look like dates, amounts look like amounts |
Confidence scores and handling uncertainty
Every extracted field includes a confidence score:
- 0.90-1.00 β High confidence, likely correct
- 0.70-0.89 β Medium confidence, may need review
- Below 0.70 β Low confidence, likely needs human verification
Best practice: Set a threshold (e.g., 0.85) and flag documents below it for human review. This gives you automation speed with human accuracy.
Exam relevance: The exam may test your understanding of confidence thresholds and when to involve human review β this connects to the reliability and safety responsible AI principle.
π¬ Video walkthrough
π¬ Video coming soon
Content Understanding: Documents β AI-901 Module 23
Content Understanding: Documents β AI-901 Module 23
~14 minFlashcards
Knowledge Check
GreenLeaf processes invoices from 20 different suppliers, each with a different format. They want to extract vendor name, total amount, and due date from each. What's the best approach?
Content Understanding extracts an invoice total with a confidence score of 0.65. What should the application do?
Next up: Multimodal Extraction β pulling data from images, audio, and video using Content Understanding.