🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 5
Domain 5 — Module 1 of 3 33%
25 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 5: Implement Information Extraction Solutions Premium ⏱ ~14 min read

Ingestion, Indexing & Grounding Pipelines

RAG applications need data pipelines behind them. Learn how to ingest documents, configure semantic/hybrid/vector search, apply enrichment skills, and connect retrieval pipelines to agent workflows.

The data pipeline behind RAG

☕ Simple explanation

If RAG is an open-book exam, the ingestion pipeline is the process of preparing, organising, and indexing all the books before the exam starts.

You take raw documents (PDFs, Word files, images, videos), break them into searchable chunks, add metadata, create embeddings for vector search, and load everything into a search index. When a user asks a question, the search index finds the right chunks in milliseconds.

This module covers the pipeline engineering — the plumbing that makes RAG work.

An ingestion and indexing pipeline for RAG applications involves several stages:

  • Document cracking — extracting text from PDFs, images (OCR), Office documents, and multimedia
  • Chunking — splitting content into search-optimised segments
  • Enrichment — adding metadata, extracting entities, classifying content
  • Embedding — converting chunks into vector representations for semantic search
  • Indexing — loading everything into Azure AI Search with the right field mappings
  • Connection — linking the index to Foundry agents and workflows

This module focuses on pipeline engineering — distinct from Module 10 (Building RAG Applications) which covered the app-layer RAG pattern.

The ingestion pipeline

StageWhat HappensService
SourceRaw documents in storage (Blob, Data Lake, SharePoint)Azure Storage
CrackExtract text from files (PDF parsing, OCR for images, audio transcription)Azure AI Search indexer + Content Understanding
ChunkSplit into search-friendly segmentsChunking strategy (fixed, semantic, paragraph)
EnrichAdd metadata, extract entities, classify contentBuilt-in skills or custom skills
EmbedGenerate vector representationsEmbedding model (text-embedding-3-small)
IndexStore in searchable index with field mappingsAzure AI Search
ServeConnect to agents, workflows, and applicationsFoundry SDK, agent tools

Content types and ingestion methods

Content TypeCracking MethodKey Considerations
PDF documentsBuilt-in PDF parser + OCR for scanned pagesOCR quality depends on scan quality
Office documentsBuilt-in parsers (Word, Excel, PowerPoint)Tables and charts need special handling
ImagesOCR for text, Content Understanding for structureBuilt-in OCR supports handwritten text; very poor handwriting may need review
Audio filesSpeech-to-text transcriptionLanguage and accent affect accuracy
Video filesFrame extraction + audio transcriptionHigh storage and compute requirements
Web pagesHTML parsing, content extractionExclude navigation, ads, boilerplate

Enrichment skills

Skill TypeWhat It DoesExample
Built-in: Entity extractionIdentifies people, places, organisationsTag documents with mentioned companies
Built-in: Language detectionIdentifies document languageRoute to correct language model
Built-in: Key phrase extractionExtracts important phrasesGenerate topic tags for filtering
Built-in: OCRReads text from images within documentsExtract text from embedded charts
Custom skillYour own enrichment logic (API)Industry-specific classification, PII detection
Built-in vs custom enrichment skills
FeatureBuilt-in SkillsCustom Skills
SetupConfigure in the indexer skillsetWrite code, deploy as API, reference in skillset
MaintenanceManaged by MicrosoftYou manage the code and infrastructure
CapabilitiesGeneral-purpose NLP enrichmentAny custom logic you need
CostIncluded in Search pricingYour compute costs + Search pricing
Best forStandard metadata enrichmentDomain-specific classification, PII, business logic

Connecting pipelines to agents

Connection MethodHow It WorksBest For
Direct index queryAgent tool calls Azure AI Search directlyFull control over search parameters
Foundry IQUpload to Foundry IQ, auto-indexedQuick agent setup, managed pipeline
Custom retrieval functionAgent calls your function, which queries the indexComplex retrieval logic, multi-index queries
ℹ️ Real-world example: NeuralMed's medical article pipeline

NeuralMed ingests 10,000+ medical articles for their patient chatbot:

  1. Source: PubMed articles in Blob Storage (PDF format)
  2. Crack: PDF parser extracts text + OCR for embedded figures
  3. Chunk: Paragraph-level chunking (medical context needs larger chunks)
  4. Enrich:
    • Built-in: entity extraction (drug names, conditions, treatments)
    • Custom skill: medical speciality classifier (cardiology, neurology, etc.)
    • Custom skill: PII detector (redacts patient info from case studies)
  5. Embed: text-embedding-3-small for vector search
  6. Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
  7. Serve: Connected to patient chatbot agent as a knowledge tool

Pipeline runs weekly on new articles. Incremental indexing only processes changed documents.

💡 Exam tip: OCR in the RAG pipeline

The exam specifically mentions OCR in RAG ingestion flows. Key points:

  • OCR is needed for scanned PDFs, image-based documents, and photos of forms
  • OCR quality directly affects RAG quality — garbage in, garbage out
  • Azure AI Search’s built-in OCR skill handles common cases
  • For high-accuracy OCR (medical, legal), use Content Understanding’s OCR capability

If a question mentions “scanned documents” in a RAG context, OCR is the answer.

Key terms

Question

What is document cracking?

Click or press Enter to reveal answer

Answer

The process of extracting raw text and metadata from source files (PDFs, Office docs, images). The first stage of an ingestion pipeline. Uses parsers for structured formats and OCR for image-based content.

Click to flip back

Question

What are enrichment skills in Azure AI Search?

Click or press Enter to reveal answer

Answer

Processing steps that add metadata to indexed content during ingestion. Built-in skills include entity extraction, key phrases, and OCR. Custom skills are your own code (deployed as APIs) for domain-specific enrichment.

Click to flip back

Question

What is incremental indexing?

Click or press Enter to reveal answer

Answer

An indexing strategy that only processes new or changed documents instead of re-indexing everything. Reduces cost and time for regularly updated content collections.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial needs to index 50,000 scanned regulatory documents (image PDFs) for their compliance agent's knowledge base. Many documents contain handwritten annotations. What pipeline configuration is critical?

Knowledge Check

Kai's logistics platform indexes shipping documents from 15 countries in different languages. The search results need to include document language and key shipping terms as filterable metadata. Which enrichment approach should he use?

🎬 Video coming soon

← Previous

Speech, Translation & Voice Agents

Next →

Extracting Content with Content Understanding

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.