Ingestion, Indexing & Grounding Pipelines
RAG applications need data pipelines behind them. Learn how to ingest documents, configure semantic/hybrid/vector search, apply enrichment skills, and connect retrieval pipelines to agent workflows.
The data pipeline behind RAG
If RAG is an open-book exam, the ingestion pipeline is the process of preparing, organising, and indexing all the books before the exam starts.
You take raw documents (PDFs, Word files, images, videos), break them into searchable chunks, add metadata, create embeddings for vector search, and load everything into a search index. When a user asks a question, the search index finds the right chunks in milliseconds.
This module covers the pipeline engineering — the plumbing that makes RAG work.
The ingestion pipeline
| Stage | What Happens | Service |
|---|---|---|
| Source | Raw documents in storage (Blob, Data Lake, SharePoint) | Azure Storage |
| Crack | Extract text from files (PDF parsing, OCR for images, audio transcription) | Azure AI Search indexer + Content Understanding |
| Chunk | Split into search-friendly segments | Chunking strategy (fixed, semantic, paragraph) |
| Enrich | Add metadata, extract entities, classify content | Built-in skills or custom skills |
| Embed | Generate vector representations | Embedding model (text-embedding-3-small) |
| Index | Store in searchable index with field mappings | Azure AI Search |
| Serve | Connect to agents, workflows, and applications | Foundry SDK, agent tools |
Content types and ingestion methods
| Content Type | Cracking Method | Key Considerations |
|---|---|---|
| PDF documents | Built-in PDF parser + OCR for scanned pages | OCR quality depends on scan quality |
| Office documents | Built-in parsers (Word, Excel, PowerPoint) | Tables and charts need special handling |
| Images | OCR for text, Content Understanding for structure | Built-in OCR supports handwritten text; very poor handwriting may need review |
| Audio files | Speech-to-text transcription | Language and accent affect accuracy |
| Video files | Frame extraction + audio transcription | High storage and compute requirements |
| Web pages | HTML parsing, content extraction | Exclude navigation, ads, boilerplate |
Enrichment skills
| Skill Type | What It Does | Example |
|---|---|---|
| Built-in: Entity extraction | Identifies people, places, organisations | Tag documents with mentioned companies |
| Built-in: Language detection | Identifies document language | Route to correct language model |
| Built-in: Key phrase extraction | Extracts important phrases | Generate topic tags for filtering |
| Built-in: OCR | Reads text from images within documents | Extract text from embedded charts |
| Custom skill | Your own enrichment logic (API) | Industry-specific classification, PII detection |
| Feature | Built-in Skills | Custom Skills |
|---|---|---|
| Setup | Configure in the indexer skillset | Write code, deploy as API, reference in skillset |
| Maintenance | Managed by Microsoft | You manage the code and infrastructure |
| Capabilities | General-purpose NLP enrichment | Any custom logic you need |
| Cost | Included in Search pricing | Your compute costs + Search pricing |
| Best for | Standard metadata enrichment | Domain-specific classification, PII, business logic |
Connecting pipelines to agents
| Connection Method | How It Works | Best For |
|---|---|---|
| Direct index query | Agent tool calls Azure AI Search directly | Full control over search parameters |
| Foundry IQ | Upload to Foundry IQ, auto-indexed | Quick agent setup, managed pipeline |
| Custom retrieval function | Agent calls your function, which queries the index | Complex retrieval logic, multi-index queries |
Real-world example: NeuralMed's medical article pipeline
NeuralMed ingests 10,000+ medical articles for their patient chatbot:
- Source: PubMed articles in Blob Storage (PDF format)
- Crack: PDF parser extracts text + OCR for embedded figures
- Chunk: Paragraph-level chunking (medical context needs larger chunks)
- Enrich:
- Built-in: entity extraction (drug names, conditions, treatments)
- Custom skill: medical speciality classifier (cardiology, neurology, etc.)
- Custom skill: PII detector (redacts patient info from case studies)
- Embed: text-embedding-3-small for vector search
- Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
- Serve: Connected to patient chatbot agent as a knowledge tool
Pipeline runs weekly on new articles. Incremental indexing only processes changed documents.
Exam tip: OCR in the RAG pipeline
The exam specifically mentions OCR in RAG ingestion flows. Key points:
- OCR is needed for scanned PDFs, image-based documents, and photos of forms
- OCR quality directly affects RAG quality — garbage in, garbage out
- Azure AI Search’s built-in OCR skill handles common cases
- For high-accuracy OCR (medical, legal), use Content Understanding’s OCR capability
If a question mentions “scanned documents” in a RAG context, OCR is the answer.
Key terms
Knowledge check
Atlas Financial needs to index 50,000 scanned regulatory documents (image PDFs) for their compliance agent's knowledge base. Many documents contain handwritten annotations. What pipeline configuration is critical?
Kai's logistics platform indexes shipping documents from 15 countries in different languages. The search results need to include document language and key shipping terms as filterable metadata. Which enrichment approach should he use?
🎬 Video coming soon