🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 5
Domain 5 — Module 1 of 3 33%
23 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 5: Optimize Generative AI Systems and Model Performance Free ⏱ ~15 min read

RAG Optimization: Better Retrieval, Better Answers

RAG is only as good as its retrieval. Master chunking strategies, similarity thresholds, retrieval tuning, and A/B testing frameworks to make your GenAI actually answer questions correctly.

AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.

Why RAG optimization matters

☕ Simple explanation

RAG is like looking up answers in a textbook before answering a question.

Imagine you’re in an exam and you’re allowed to use a textbook. The quality of your answer depends entirely on whether you find the RIGHT page. If you flip to a random chapter, your answer will be wrong — even if you’re brilliant at writing.

RAG optimization is about making sure you find the right page every time. That means: How big are the sections you search? (Chunking.) How picky are you about matches? (Similarity threshold.) How do you pick the best results? (Retrieval strategy.)

Get retrieval wrong and even GPT-4o gives bad answers. Get it right and even a smaller model shines.

Retrieval-Augmented Generation (RAG) quality is bounded by retrieval quality. The model can only ground its answer in the documents it receives. Key optimization dimensions:

  • Chunking — how source documents are split into searchable segments
  • Similarity threshold — minimum relevance score for retrieved chunks
  • Retrieval strategy — how results are ranked and selected (top-k, MMR, re-ranking)
  • A/B testing — systematically comparing configurations to find the best setup

Each dimension interacts with the others. Small chunks with strict thresholds favour precision; large chunks with lenient thresholds favour recall. The right balance depends on your domain and use case.

Chunking strategies

Chunking is how you split source documents into segments for indexing. It’s the single most impactful tuning parameter for RAG quality.

Chunking strategies comparison
FeatureChunk SizeOverlapBest ForRisk
Fixed-size256-512 tokens10-20% overlapGeneral purpose, simple implementationSplits mid-sentence, breaks context
Sentence-based1-5 sentences1 sentence overlapFAQ documents, short-form contentMay be too small for complex topics
SemanticVariable (by topic)Topic boundary overlapLong documents with clear topic shiftsComplex to implement, slower indexing
Document structureHeaders/sectionsSection overlapStructured docs (legal, technical)Requires well-formatted source docs

Why overlap matters

Chunks overlap to prevent losing information at boundaries. Without overlap, a sentence that spans two chunks gets split — and neither chunk has the full meaning.

Example without overlap:

Chunk 1Chunk 2
”The refund policy allows returns within 30 days.""After the 30-day period, only store credit is available for items in original packaging.”

A question about “store credit conditions” might only match Chunk 2, missing the 30-day context from Chunk 1.

Example with 20% overlap:

Chunk 1Chunk 2
”The refund policy allows returns within 30 days. After the 30-day period, only store credit…""After the 30-day period, only store credit is available for items in original packaging.”

Now both chunks contain the boundary information. The overlap ensures retrieval finds the full context.

💡 Exam tip: Chunk size is the number-one tuning parameter

If a question asks “what should you tune first to improve RAG quality,” the answer is almost always chunk size.

Rules of thumb:

  • Too small (under 100 tokens): loses context, lots of noise in results
  • Too large (over 1000 tokens): dilutes relevance, wastes context window
  • Sweet spot: 256-512 tokens for most use cases
  • Always use overlap: 10-20% prevents information loss at boundaries

The exam may present a scenario where retrieval returns irrelevant results. First check: are the chunks the right size?

Similarity thresholds

After retrieval, each result has a similarity score (0 to 1). The similarity threshold controls the minimum score a chunk must have to be included in the context.

ThresholdEffectRisk
Low (0.5-0.6)Returns many results, high recallIncludes irrelevant chunks (noise)
Medium (0.7-0.8)Balanced precision and recallMay miss edge-case matches
High (0.85+)Only very relevant resultsMisses valid results with different wording

Finding the right threshold

There’s no universal “correct” threshold. It depends on:

  • Vocabulary consistency: Technical docs with consistent terminology can use higher thresholds
  • Query diversity: If users ask the same question in many ways, lower thresholds catch more variants
  • Consequence of missing: Medical/legal — lower threshold (better to retrieve too much). Casual FAQ — higher threshold (reduce noise)

Retrieval strategies

Top-K retrieval

The simplest strategy: return the K most similar chunks.

  • Advantage: fast, predictable
  • Disadvantage: results may be redundant (top 5 chunks all say the same thing)

Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity. After finding the top match, it penalises subsequent results that are too similar to already-selected results.

  • Advantage: diverse context — covers more aspects of the question
  • Disadvantage: may include a slightly less relevant chunk for diversity

Re-ranking

A two-stage approach: first retrieve a larger set (top 20-50), then use a separate re-ranking model to score and re-order them, returning the best K.

  • Advantage: more accurate ranking than vector similarity alone
  • Disadvantage: adds latency (extra model call)
Retrieval strategies comparison
FeatureSpeedResult QualityDiversityBest For
Top-KFastestGoodLow (may be redundant)Simple use cases, low-latency requirements
MMRFastGoodHigh (balances relevance + diversity)Complex questions requiring broad context
Re-rankingSlower (extra model call)BestDepends on re-rankerHigh-accuracy requirements (legal, medical)

A/B testing retrieval configurations

Don’t guess which configuration is best — test it systematically.

from azure.ai.evaluation import evaluate, RelevanceEvaluator

# Configuration A: small chunks, strict threshold
config_a = {
    "chunk_size": 256,
    "overlap": 50,
    "threshold": 0.8,
    "top_k": 5,
}

# Configuration B: larger chunks, relaxed threshold
config_b = {
    "chunk_size": 512,
    "overlap": 100,
    "threshold": 0.7,
    "top_k": 3,
}

# Run evaluation on both configurations
results_a = evaluate(
    data="eval_dataset.jsonl",
    evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)

results_b = evaluate(
    data="eval_dataset_config_b.jsonl",
    evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)

print(f"Config A relevance: {results_a['metrics']['relevance']}")
print(f"Config B relevance: {results_b['metrics']['relevance']}")

What’s happening:

  • Lines 4-16: Define two retrieval configurations to compare
  • Lines 19-27: Run the same evaluation dataset through both configurations
  • Lines 29-30: Compare relevance scores — whichever configuration produces higher relevance is better for your use case

What to A/B test

ParameterTest RangeImpact
Chunk size128, 256, 512, 1024 tokensBiggest impact on quality
Overlap0%, 10%, 20%, 30%Prevents boundary information loss
Similarity threshold0.6, 0.7, 0.8, 0.85Precision vs recall trade-off
Top-K3, 5, 10, 20More context vs noise
Retrieval strategyTop-K vs MMR vs re-rankingQuality vs latency

Azure AI Search integration for RAG

Azure AI Search is the primary retrieval backend for RAG in Azure. Here’s how to configure an index with optimised chunking:

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
)

# Configure vector search with HNSW algorithm
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="hnsw-config",
            parameters={
                "m": 4,          # Bi-directional link count
                "efConstruction": 400,  # Index build quality
                "efSearch": 500,        # Search quality
                "metric": "cosine",
            },
        ),
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-config",
        ),
    ],
)

# Configure semantic search (for hybrid search)
semantic_search = SemanticSearch(
    configurations=[
        SemanticConfiguration(
            name="semantic-config",
            prioritized_fields=SemanticPrioritizedFields(
                content_fields=[SemanticField(field_name="content")],
            ),
        ),
    ],
)

What’s happening:

  • Lines 16-33: Configure the HNSW (Hierarchical Navigable Small World) vector search algorithm — this controls how vectors are indexed and searched
  • Lines 19-24: HNSW parameters tune the quality-speed trade-off: higher efSearch = better results but slower
  • Lines 36-46: Semantic search configuration enables hybrid search (vector + keyword combined)
Scenario: Zara optimises Atlas's legal document RAG

Atlas Consulting’s legal document chatbot is returning irrelevant results. Zara investigates:

Problem: Chunks are 1024 tokens (too large). A single chunk contains three different legal clauses. When a user asks about “termination clause,” the retrieval returns a chunk that mentions termination but is mostly about payment terms.

Zara’s A/B test:

ConfigurationChunk SizeOverlapThresholdTop-KRelevance Score
Current1024 tokens0%0.653.2
Option A256 tokens20%0.7554.1
Option B512 tokens15%0.734.4

Winner: Option B. The 512-token chunks keep enough legal context per clause without mixing unrelated clauses. The 15% overlap prevents splitting cross-referenced provisions. Fewer results (top 3) with a moderate threshold reduce noise while keeping coverage.

Marcus Webb approves the configuration change. Relevance jumps from 3.2 to 4.4 — a 38% improvement.

Key terms flashcards

Question

What is chunk overlap and why does it matter?

Click or press Enter to reveal answer

Answer

Overlap is the percentage of content shared between adjacent chunks (typically 10-20%). It prevents information loss at chunk boundaries — sentences that span two chunks are captured in both, ensuring retrieval finds the full context.

Click to flip back

Question

What is Maximum Marginal Relevance (MMR)?

Click or press Enter to reveal answer

Answer

A retrieval strategy that balances relevance with diversity. After selecting the most relevant chunk, it penalises subsequent results that are too similar to already-selected ones. Prevents redundant context where top-5 results all say the same thing.

Click to flip back

Question

What is re-ranking in RAG?

Click or press Enter to reveal answer

Answer

A two-stage retrieval approach: first retrieve a large set (top 20-50) using fast vector search, then use a separate re-ranking model to score and reorder them. Returns better results than vector similarity alone, but adds latency.

Click to flip back

Question

What should you A/B test first in RAG optimization?

Click or press Enter to reveal answer

Answer

Chunk size — it has the biggest impact on RAG quality. Test 128, 256, 512, and 1024 tokens. Then test overlap (0-30%), similarity threshold (0.6-0.85), top-K (3-20), and retrieval strategy.

Click to flip back

Knowledge check

Knowledge Check

Atlas Consulting's chatbot retrieves 5 chunks for every query, but users complain that answers repeat the same information from different angles instead of covering all aspects of their question. Which retrieval strategy should Zara switch to?

Knowledge Check

Dr. Luca's genomics RAG system frequently splits gene descriptions across two chunks, causing incomplete retrieval. The chunks are 256 tokens with 0% overlap. What should he try first?

🎬 Video coming soon


Next up: Embeddings & Hybrid Search — because vector search alone isn’t enough.

← Previous

Cost Tracking, Logging & Debugging

Next →

Embeddings & Hybrid Search

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.