Ingestion, Indexing & Grounding Pipelines

The data pipeline behind RAG

Simple explanation

If RAG is an open-book exam, the ingestion pipeline is the process of preparing, organising, and indexing all the books before the exam starts.

You take raw documents (PDFs, Word files, images, videos), break them into searchable chunks, add metadata, create embeddings for vector search, and load everything into a search index. When a user asks a question, the search index finds the right chunks in milliseconds.

This module covers the pipeline engineering — the plumbing that makes RAG work.

The ingestion pipeline

Stage	What Happens	Service
Source	Raw documents in storage (Blob, Data Lake, SharePoint)	Azure Storage
Crack	Extract text from files (PDF parsing, OCR for images, audio transcription)	Azure AI Search indexer + Content Understanding
Chunk	Split into search-friendly segments	Chunking strategy (fixed, semantic, paragraph)
Enrich	Add metadata, extract entities, classify content	Built-in skills or custom skills
Embed	Generate vector representations	Embedding model (text-embedding-3-small)
Index	Store in searchable index with field mappings	Azure AI Search
Serve	Connect to agents, workflows, and applications	Foundry SDK, agent tools

Content types and ingestion methods

Content Type	Cracking Method	Key Considerations
PDF documents	Built-in PDF parser + OCR for scanned pages	OCR quality depends on scan quality
Office documents	Built-in parsers (Word, Excel, PowerPoint)	Tables and charts need special handling
Images	OCR for text, Content Understanding for structure	Built-in OCR supports handwritten text; very poor handwriting may need review
Audio files	Speech-to-text transcription	Language and accent affect accuracy
Video files	Frame extraction + audio transcription	High storage and compute requirements
Web pages	HTML parsing, content extraction	Exclude navigation, ads, boilerplate

Enrichment skills

Skill Type	What It Does	Example
Built-in: Entity extraction	Identifies people, places, organisations	Tag documents with mentioned companies
Built-in: Language detection	Identifies document language	Route to correct language model
Built-in: Key phrase extraction	Extracts important phrases	Generate topic tags for filtering
Built-in: OCR	Reads text from images within documents	Extract text from embedded charts
Custom skill	Your own enrichment logic (API)	Industry-specific classification, PII detection

Built-in vs custom enrichment skills
Feature	Built-in Skills	Custom Skills
Setup	Configure in the indexer skillset	Write code, deploy as API, reference in skillset
Maintenance	Managed by Microsoft	You manage the code and infrastructure
Capabilities	General-purpose NLP enrichment	Any custom logic you need
Cost	Included in Search pricing	Your compute costs + Search pricing
Best for	Standard metadata enrichment	Domain-specific classification, PII, business logic

Connecting pipelines to agents

Connection Method	How It Works	Best For
Direct index query	Agent tool calls Azure AI Search directly	Full control over search parameters
Foundry IQ	Upload to Foundry IQ, auto-indexed	Quick agent setup, managed pipeline
Custom retrieval function	Agent calls your function, which queries the index	Complex retrieval logic, multi-index queries

Real-world example: NeuralMed's medical article pipeline

NeuralMed ingests 10,000+ medical articles for their patient chatbot:

Source: PubMed articles in Blob Storage (PDF format)
Crack: PDF parser extracts text + OCR for embedded figures
Chunk: Paragraph-level chunking (medical context needs larger chunks)
Enrich:
- Built-in: entity extraction (drug names, conditions, treatments)
- Custom skill: medical speciality classifier (cardiology, neurology, etc.)
- Custom skill: PII detector (redacts patient info from case studies)
Embed: text-embedding-3-small for vector search
Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
Serve: Connected to patient chatbot agent as a knowledge tool

Pipeline runs weekly on new articles. Incremental indexing only processes changed documents.

Exam tip: OCR in the RAG pipeline

The exam specifically mentions OCR in RAG ingestion flows. Key points:

OCR is needed for scanned PDFs, image-based documents, and photos of forms
OCR quality directly affects RAG quality — garbage in, garbage out
Azure AI Search’s built-in OCR skill handles common cases
For high-accuracy OCR (medical, legal), use Content Understanding’s OCR capability

If a question mentions “scanned documents” in a RAG context, OCR is the answer.

Key terms

Question

What is document cracking?

Click or press Enter to reveal answer

Answer

The process of extracting raw text and metadata from source files (PDFs, Office docs, images). The first stage of an ingestion pipeline. Uses parsers for structured formats and OCR for image-based content.

Click to flip back

Question

What are enrichment skills in Azure AI Search?

Click or press Enter to reveal answer

Answer

Processing steps that add metadata to indexed content during ingestion. Built-in skills include entity extraction, key phrases, and OCR. Custom skills are your own code (deployed as APIs) for domain-specific enrichment.

Click to flip back

Question

What is incremental indexing?

Click or press Enter to reveal answer

Answer

An indexing strategy that only processes new or changed documents instead of re-indexing everything. Reduces cost and time for regularly updated content collections.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial needs to index 50,000 scanned regulatory documents (image PDFs) for their compliance agent's knowledge base. Many documents contain handwritten annotations. What pipeline configuration is critical?

Knowledge Check

Kai's logistics platform indexes shipping documents from 15 countries in different languages. The search results need to include document language and key shipping terms as filterable metadata. Which enrichment approach should he use?