Building an Extraction App

Building a complete extraction pipeline

Simple explanation

You’ve learned what Content Understanding can extract. Now let’s build a real app that processes documents automatically.

GreenLeaf receives hundreds of documents daily — invoices, delivery notes, quality reports. Instead of manual data entry, their app automatically: reads each document, extracts the important fields, checks confidence scores, routes low-confidence items for human review, and saves everything to their database.

Architecture of an extraction app

Documents arrive → Classify type → Select analyzer → Extract fields → Check confidence → Route
                                                                       ↓
                                                    High confidence: Save to database
                                                    Low confidence: Queue for human review
                                                    Error: Log and alert

Building the app: step by step

Step 1: Process a document

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential

client = DocumentIntelligenceClient(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-key")
)

def process_invoice(file_path):
    with open(file_path, "rb") as f:
        result = client.begin_analyze_document(
            analyzer_id="prebuilt-invoice",
            body=f.read()
        ).result()

    extracted = {}
    for doc in result.documents:
        for field_name, field in doc.fields.items():
            extracted[field_name] = {
                "value": field.content,
                "confidence": field.confidence
            }

    return extracted

Step 2: Route by confidence

CONFIDENCE_THRESHOLD = 0.85

def route_document(extracted_data):
    low_confidence_fields = []

    for field_name, field_data in extracted_data.items():
        if field_data["confidence"] < CONFIDENCE_THRESHOLD:
            low_confidence_fields.append(field_name)

    if low_confidence_fields:
        return "human_review", low_confidence_fields
    else:
        return "auto_accept", []

Step 3: Handle multiple document types

def process_document(file_path, doc_type):
    model_map = {
        "invoice": "prebuilt-invoice",
        "receipt": "prebuilt-receipt",
        "id_card": "prebuilt-idDocument",
        "crop_report": "custom-crop-report"  # Custom model
    }

    model_id = model_map.get(doc_type, "prebuilt-layout")

    with open(file_path, "rb") as f:
        result = client.begin_analyze_document(
            analyzer_id=model_id,
            body=f.read()
        ).result()

    return result

Production best practices

Practice	Why
Set confidence thresholds	Route uncertain extractions for human review
Handle errors gracefully	Corrupted files, unsupported formats, API timeouts
Log everything	Track extraction accuracy, common failures, throughput
Batch processing	Process documents in parallel for high volume
Validate extracted data	Check formats (dates, numbers, emails) before saving
Version your custom analyzers	Track model performance over time, roll back if needed

Human-in-the-loop pattern

The human-in-the-loop pattern is critical for production extraction apps:

AI extracts data automatically (fast, cheap)
High-confidence results are accepted automatically
Low-confidence results are queued for human review
Humans correct errors and confirm uncertain extractions
Corrected data can be used to improve the model (custom training)

This pattern balances automation speed with human accuracy — connecting directly to the reliability and safety responsible AI principle.

Exam relevance: Expect questions about when human review is needed and how confidence thresholds work.

🎬 Video walkthrough

Flashcards

Question

What is the human-in-the-loop pattern in extraction apps?

Click or press Enter to reveal answer

Answer

AI extracts data automatically. High-confidence results are accepted, low-confidence results go to humans for review. Balances automation speed with human accuracy. Connects to the reliability and safety AI principle.

Click to flip back

Question

What should an extraction app do when a field has low confidence?

Click or press Enter to reveal answer

Answer

Queue the document for human review rather than accepting potentially incorrect data. Set a confidence threshold (typically 0.85+), and flag fields below it.

Click to flip back

Question

How do you process different document types in one extraction app?

Click or press Enter to reveal answer

Answer

Map document types to model IDs: invoices → prebuilt-invoice, receipts → prebuilt-receipt, custom documents → your custom model. Detect or specify the document type, then use the appropriate model.

Click to flip back

Knowledge Check

GreenLeaf's extraction app processes an invoice. The vendor name has confidence 0.95 but the total amount has confidence 0.72. What should the app do?

Knowledge Check

MediSpark receives three document types: patient intake forms (custom format), standard invoices, and photo IDs. How should they set up their extraction app?

Next up: Exam Prep — reviewing everything you’ve learned and getting ready for the AI-901 exam.