Extracting Content with Content Understanding
Content Understanding is Foundry's extraction powerhouse. Learn how to extract structured data from documents using multimodal pipelines that combine OCR, layout analysis, and field extraction.
From chaos to structure
Content Understanding takes messy real-world documents β scanned invoices, handwritten forms, photographed receipts β and converts them into clean, structured data that your AI can work with.
Itβs like a super-efficient data entry clerk who can read any document, understand its layout, and extract the specific fields you need β all in seconds, not hours.
Content Understanding pipeline
| Stage | What It Does | Output |
|---|---|---|
| OCR | Reads all text from the document | Raw text with positions |
| Layout analysis | Identifies structure: tables, headings, sections | Structured layout map |
| Field extraction | Pulls specific values based on the document type | Named fields with values and confidence |
| Output formatting | Converts to structured JSON or clean Markdown | Ready for storage, APIs, or LLM consumption |
Two output modes
| Feature | Structured JSON Output | Markdown Output |
|---|---|---|
| Format | JSON with named fields and values | Clean Markdown preserving document structure |
| Best for | Database storage, API consumption, form processing | LLM reasoning in RAG, agent knowledge, downstream AI |
| Example use | Invoice processing: extract 'invoice_number: 12345' | Convert contract to Markdown for compliance agent to reason about |
| Precision | High β specific fields with confidence scores | Comprehensive β full document content preserved |
| Flexibility | Need to define which fields to extract | All content preserved, model decides what's relevant |
Analyzers for different document types
| Analyzer | Document Types | Extracted Fields |
|---|---|---|
| Invoice | Invoices, bills | Invoice number, date, vendor, line items, total, tax |
| Receipt | Receipts | Merchant, date, items, subtotal, tax, total, tip |
| ID document | Passports, driverβs licenses | Name, DOB, document number, expiry, nationality |
| Business card | Business cards | Name, title, company, phone, email, address |
| General document | Any document | Tables, key-value pairs, paragraphs, headings |
| Custom | Your specific formats | Fields you define with training examples |
Producing grounded representations for agents
Content Understandingβs Markdown output is particularly powerful for AI applications:
| Use Case | How It Works |
|---|---|
| RAG grounding | Convert documents to clean Markdown β chunk β index β retrieve for RAG |
| Agent knowledge | Extract Markdown β feed to agent as context for reasoning |
| Structured + reasoning | Extract specific fields as JSON + full Markdown for context |
| Downstream reasoning | Clean Markdown preserves tables, headings, and relationships for LLM understanding |
Real-world example: NeuralMed's medical record extraction
NeuralMed processes thousands of medical documents daily:
Lab reports (structured JSON):
- Extract: test name, value, normal range, flag (high/low/normal)
- Output: JSON directly into patient records database
- Confidence threshold: 0.95 β below that, flag for human review
Clinical notes (Markdown):
- Convert handwritten doctor notes to clean Markdown
- Preserve structure: chief complaint, history, examination, assessment, plan
- Markdown fed to diagnostic assistant agent for reasoning
Insurance forms (hybrid):
- Structured fields: member ID, group number, dates (JSON for database)
- Full form content: Markdown for compliance agent to verify coverage terms
Three document types, three extraction strategies β all through Content Understanding.
Exam tip: JSON vs Markdown output
The exam tests when to use each:
- Need specific fields in a database? β Structured JSON (field extraction)
- Need full content for an LLM to reason about? β Markdown output
- Need both? β Extract JSON fields AND produce Markdown β theyβre not mutually exclusive
Key rule: JSON for machines, Markdown for AI models.
Key terms
Knowledge check
Atlas Financial receives 5,000 loan applications monthly as scanned PDFs. They need to extract the applicant name, loan amount, and employment status into their loan processing database. Which Content Understanding output should they use?
Kai's logistics agent needs to reason about shipping contracts to answer questions like 'What are the penalty clauses for late delivery?' The contracts are complex multi-page PDFs. Which Content Understanding output should feed the agent?
π¬ Video coming soon