Content Understanding: Documents & Forms

Extracting data from documents

Simple explanation

Content Understanding is like a super-powered data entry clerk — it reads any document and pulls out exactly the data you need.

Hand it a stack of invoices → it gives you a spreadsheet of vendor names, amounts, and dates. Hand it receipts → it gives you items and totals. Hand it ID cards → it gives you names and ID numbers. All automatically, all accurate, all in seconds.

The magic is that it doesn’t just read text (that’s OCR). It understands the document structure — it knows that the number next to “Total:” is the total amount, not just a random number.

What Content Understanding extracts

Pre-built document models

Content Understanding includes prebuilt analyzers for common document types:

Document Type	Fields Extracted
Invoices	Vendor name, invoice number, date, line items, subtotal, tax, total, payment terms
Receipts	Merchant name, date, items, prices, total, tax, tip
ID documents	Name, date of birth, document number, nationality, expiry date
Business cards	Name, title, company, phone, email, address
Tax forms (W-2, 1099)	Employee/employer info, wages, tax withheld
Health insurance cards	Member name, ID, group number, plan type

Custom analyzers

For documents specific to your business, you can train custom analyzers:

Upload sample documents
Label the fields you want to extract
Train the model
Deploy and use in your application

GreenLeaf scenario: GreenLeaf receives supplier invoices in 20 different formats. They use the pre-built invoice model to extract vendor, amount, and due date — no training needed. For their custom crop inspection reports, they train a custom model to extract field location, crop type, and health rating.

Building a document extraction app

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential

client = DocumentIntelligenceClient(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-key")
)

# Analyse an invoice
with open("invoice.pdf", "rb") as f:
    result = client.begin_analyze_document(
        analyzer_id="prebuilt-invoice",
        body=f.read()
    ).result()

# Extract fields
for document in result.documents:
    vendor = document.fields.get("VendorName")
    total = document.fields.get("InvoiceTotal")
    date = document.fields.get("InvoiceDate")

    print(f"Vendor: {vendor.content if vendor else 'N/A'}")
    print(f"Total: {total.content if total else 'N/A'}")
    print(f"Date: {date.content if date else 'N/A'}")

How it works under the hood

Content Understanding processes documents in layers:

Layer	What Happens
1. OCR	Reads all text from the document (printed and handwritten)
2. Layout analysis	Identifies tables, headers, paragraphs, sections, and page structure
3. Field mapping	Maps specific text regions to named fields based on the model
4. Confidence scoring	Each extracted field includes a confidence score (0.0 to 1.0)
5. Validation	Checks formats — dates look like dates, amounts look like amounts

Confidence scores and handling uncertainty

Every extracted field includes a confidence score:

0.90-1.00 — High confidence, likely correct
0.70-0.89 — Medium confidence, may need review
Below 0.70 — Low confidence, likely needs human verification

Best practice: Set a threshold (e.g., 0.85) and flag documents below it for human review. This gives you automation speed with human accuracy.

Exam relevance: The exam may test your understanding of confidence thresholds and when to involve human review — this connects to the reliability and safety responsible AI principle.

🎬 Video walkthrough

Flashcards

Question

What is the difference between OCR and Content Understanding?

Click or press Enter to reveal answer

Answer

OCR extracts raw text from images. Content Understanding goes further — it understands document structure (tables, headers) and extracts specific named fields with their values. OCR gives you text; Content Understanding gives you structured data.

Click to flip back

Question

What prebuilt analyzers does Content Understanding include?

Click or press Enter to reveal answer

Answer

Invoices, receipts, ID documents, business cards, tax forms (W-2, 1099), and health insurance cards. Each extracts specific fields relevant to that document type.

Click to flip back

Question

What is a confidence score in Content Understanding?

Click or press Enter to reveal answer

Answer

A number from 0.0 to 1.0 indicating how confident the model is in each extracted field value. High (0.90+) = likely correct, medium (0.70-0.89) = may need review, low (below 0.70) = needs human verification.

Click to flip back

Question

When would you train a custom Content Understanding model?

Click or press Enter to reveal answer

Answer

When your documents are specific to your business and not covered by prebuilt analyzers. Upload sample documents, label the fields you want to extract, train, then deploy.

Click to flip back

Knowledge Check

GreenLeaf processes invoices from 20 different suppliers, each with a different format. They want to extract vendor name, total amount, and due date from each. What's the best approach?

Knowledge Check

Content Understanding extracts an invoice total with a confidence score of 0.65. What should the application do?

Next up: Multimodal Extraction — pulling data from images, audio, and video using Content Understanding.