Extracting Content with Content Understanding

From chaos to structure

Simple explanation

Content Understanding takes messy real-world documents — scanned invoices, handwritten forms, photographed receipts — and converts them into clean, structured data that your AI can work with.

It’s like a super-efficient data entry clerk who can read any document, understand its layout, and extract the specific fields you need — all in seconds, not hours.

Content Understanding pipeline

Stage	What It Does	Output
OCR	Reads all text from the document	Raw text with positions
Layout analysis	Identifies structure: tables, headings, sections	Structured layout map
Field extraction	Pulls specific values based on the document type	Named fields with values and confidence
Output formatting	Converts to structured JSON or clean Markdown	Ready for storage, APIs, or LLM consumption

Two output modes

Structured JSON vs Markdown extraction
Feature	Structured JSON Output	Markdown Output
Format	JSON with named fields and values	Clean Markdown preserving document structure
Best for	Database storage, API consumption, form processing	LLM reasoning in RAG, agent knowledge, downstream AI
Example use	Invoice processing: extract 'invoice_number: 12345'	Convert contract to Markdown for compliance agent to reason about
Precision	High — specific fields with confidence scores	Comprehensive — full document content preserved
Flexibility	Need to define which fields to extract	All content preserved, model decides what's relevant

Analyzers for different document types

Analyzer	Document Types	Extracted Fields
Invoice	Invoices, bills	Invoice number, date, vendor, line items, total, tax
Receipt	Receipts	Merchant, date, items, subtotal, tax, total, tip
ID document	Passports, driver’s licenses	Name, DOB, document number, expiry, nationality
Business card	Business cards	Name, title, company, phone, email, address
General document	Any document	Tables, key-value pairs, paragraphs, headings
Custom	Your specific formats	Fields you define with training examples

Producing grounded representations for agents

Content Understanding’s Markdown output is particularly powerful for AI applications:

Use Case	How It Works
RAG grounding	Convert documents to clean Markdown → chunk → index → retrieve for RAG
Agent knowledge	Extract Markdown → feed to agent as context for reasoning
Structured + reasoning	Extract specific fields as JSON + full Markdown for context
Downstream reasoning	Clean Markdown preserves tables, headings, and relationships for LLM understanding

Real-world example: NeuralMed's medical record extraction

NeuralMed processes thousands of medical documents daily:

Lab reports (structured JSON):

Extract: test name, value, normal range, flag (high/low/normal)
Output: JSON directly into patient records database
Confidence threshold: 0.95 — below that, flag for human review

Clinical notes (Markdown):

Convert handwritten doctor notes to clean Markdown
Preserve structure: chief complaint, history, examination, assessment, plan
Markdown fed to diagnostic assistant agent for reasoning

Insurance forms (hybrid):

Structured fields: member ID, group number, dates (JSON for database)
Full form content: Markdown for compliance agent to verify coverage terms

Three document types, three extraction strategies — all through Content Understanding.

Exam tip: JSON vs Markdown output

The exam tests when to use each:

Need specific fields in a database? → Structured JSON (field extraction)
Need full content for an LLM to reason about? → Markdown output
Need both? → Extract JSON fields AND produce Markdown — they’re not mutually exclusive

Key rule: JSON for machines, Markdown for AI models.

Key terms

Question

What is layout analysis in Content Understanding?

Click or press Enter to reveal answer

Answer

The process of understanding a document's structure — identifying tables, headings, sections, key-value pairs, and reading order. Enables accurate extraction even from complex multi-column layouts.

Click to flip back

Question

What is a Markdown output from Content Understanding?

Click or press Enter to reveal answer

Answer

A clean, structured Markdown representation of a document that preserves tables, headings, and content relationships. Ideal for feeding to LLMs in RAG and agent workflows because models reason well about Markdown.

Click to flip back

Question

What is a custom analyzer in Content Understanding?

Click or press Enter to reveal answer

Answer

An analyzer trained on your specific document types and fields. You provide example documents with labelled fields, and Content Understanding learns to extract those fields from new documents of the same type.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial receives 5,000 loan applications monthly as scanned PDFs. They need to extract the applicant name, loan amount, and employment status into their loan processing database. Which Content Understanding output should they use?

Knowledge Check

Kai's logistics agent needs to reason about shipping contracts to answer questions like 'What are the penalty clauses for late delivery?' The contracts are complex multi-page PDFs. Which Content Understanding output should feed the agent?