Building an Extraction App
Build a complete information extraction application using Azure Content Understanding β processing documents, routing by confidence, and handling multiple document types.
Building a complete extraction pipeline
Youβve learned what Content Understanding can extract. Now letβs build a real app that processes documents automatically.
GreenLeaf receives hundreds of documents daily β invoices, delivery notes, quality reports. Instead of manual data entry, their app automatically: reads each document, extracts the important fields, checks confidence scores, routes low-confidence items for human review, and saves everything to their database.
Architecture of an extraction app
Documents arrive β Classify type β Select analyzer β Extract fields β Check confidence β Route
β
High confidence: Save to database
Low confidence: Queue for human review
Error: Log and alert
Building the app: step by step
Step 1: Process a document
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint="https://your-resource.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-key")
)
def process_invoice(file_path):
with open(file_path, "rb") as f:
result = client.begin_analyze_document(
analyzer_id="prebuilt-invoice",
body=f.read()
).result()
extracted = {}
for doc in result.documents:
for field_name, field in doc.fields.items():
extracted[field_name] = {
"value": field.content,
"confidence": field.confidence
}
return extracted
Step 2: Route by confidence
CONFIDENCE_THRESHOLD = 0.85
def route_document(extracted_data):
low_confidence_fields = []
for field_name, field_data in extracted_data.items():
if field_data["confidence"] < CONFIDENCE_THRESHOLD:
low_confidence_fields.append(field_name)
if low_confidence_fields:
return "human_review", low_confidence_fields
else:
return "auto_accept", []
Step 3: Handle multiple document types
def process_document(file_path, doc_type):
model_map = {
"invoice": "prebuilt-invoice",
"receipt": "prebuilt-receipt",
"id_card": "prebuilt-idDocument",
"crop_report": "custom-crop-report" # Custom model
}
model_id = model_map.get(doc_type, "prebuilt-layout")
with open(file_path, "rb") as f:
result = client.begin_analyze_document(
analyzer_id=model_id,
body=f.read()
).result()
return result
Production best practices
| Practice | Why |
|---|---|
| Set confidence thresholds | Route uncertain extractions for human review |
| Handle errors gracefully | Corrupted files, unsupported formats, API timeouts |
| Log everything | Track extraction accuracy, common failures, throughput |
| Batch processing | Process documents in parallel for high volume |
| Validate extracted data | Check formats (dates, numbers, emails) before saving |
| Version your custom analyzers | Track model performance over time, roll back if needed |
Human-in-the-loop pattern
The human-in-the-loop pattern is critical for production extraction apps:
- AI extracts data automatically (fast, cheap)
- High-confidence results are accepted automatically
- Low-confidence results are queued for human review
- Humans correct errors and confirm uncertain extractions
- Corrected data can be used to improve the model (custom training)
This pattern balances automation speed with human accuracy β connecting directly to the reliability and safety responsible AI principle.
Exam relevance: Expect questions about when human review is needed and how confidence thresholds work.
π¬ Video walkthrough
π¬ Video coming soon
Building an Extraction App β AI-901 Module 25
Building an Extraction App β AI-901 Module 25
~14 minFlashcards
Knowledge Check
GreenLeaf's extraction app processes an invoice. The vendor name has confidence 0.95 but the total amount has confidence 0.72. What should the app do?
MediSpark receives three document types: patient intake forms (custom format), standard invoices, and photo IDs. How should they set up their extraction app?
Next up: Exam Prep β reviewing everything youβve learned and getting ready for the AI-901 exam.