RAG Optimization: Better Retrieval, Better Answers

AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.

Why RAG optimization matters

Simple explanation

RAG is like looking up answers in a textbook before answering a question.

Imagine you’re in an exam and you’re allowed to use a textbook. The quality of your answer depends entirely on whether you find the RIGHT page. If you flip to a random chapter, your answer will be wrong — even if you’re brilliant at writing.

RAG optimization is about making sure you find the right page every time. That means: How big are the sections you search? (Chunking.) How picky are you about matches? (Similarity threshold.) How do you pick the best results? (Retrieval strategy.)

Get retrieval wrong and even GPT-4o gives bad answers. Get it right and even a smaller model shines.

Chunking strategies

Chunking is how you split source documents into segments for indexing. It’s the single most impactful tuning parameter for RAG quality.

Chunking strategies comparison
Feature	Chunk Size	Overlap	Best For	Risk
Fixed-size	256-512 tokens	10-20% overlap	General purpose, simple implementation	Splits mid-sentence, breaks context
Sentence-based	1-5 sentences	1 sentence overlap	FAQ documents, short-form content	May be too small for complex topics
Semantic	Variable (by topic)	Topic boundary overlap	Long documents with clear topic shifts	Complex to implement, slower indexing
Document structure	Headers/sections	Section overlap	Structured docs (legal, technical)	Requires well-formatted source docs

Why overlap matters

Chunks overlap to prevent losing information at boundaries. Without overlap, a sentence that spans two chunks gets split — and neither chunk has the full meaning.

Example without overlap:

Chunk 1	Chunk 2
”The refund policy allows returns within 30 days."	"After the 30-day period, only store credit is available for items in original packaging.”

A question about “store credit conditions” might only match Chunk 2, missing the 30-day context from Chunk 1.

Example with 20% overlap:

Chunk 1	Chunk 2
”The refund policy allows returns within 30 days. After the 30-day period, only store credit…"	"After the 30-day period, only store credit is available for items in original packaging.”

Now both chunks contain the boundary information. The overlap ensures retrieval finds the full context.

Exam tip: Chunk size is the number-one tuning parameter

If a question asks “what should you tune first to improve RAG quality,” the answer is almost always chunk size.

Rules of thumb:

Too small (under 100 tokens): loses context, lots of noise in results
Too large (over 1000 tokens): dilutes relevance, wastes context window
Sweet spot: 256-512 tokens for most use cases
Always use overlap: 10-20% prevents information loss at boundaries

The exam may present a scenario where retrieval returns irrelevant results. First check: are the chunks the right size?

Similarity thresholds

After retrieval, each result has a similarity score (0 to 1). The similarity threshold controls the minimum score a chunk must have to be included in the context.

Threshold	Effect	Risk
Low (0.5-0.6)	Returns many results, high recall	Includes irrelevant chunks (noise)
Medium (0.7-0.8)	Balanced precision and recall	May miss edge-case matches
High (0.85+)	Only very relevant results	Misses valid results with different wording

Finding the right threshold

There’s no universal “correct” threshold. It depends on:

Vocabulary consistency: Technical docs with consistent terminology can use higher thresholds
Query diversity: If users ask the same question in many ways, lower thresholds catch more variants
Consequence of missing: Medical/legal — lower threshold (better to retrieve too much). Casual FAQ — higher threshold (reduce noise)

Retrieval strategies

Top-K retrieval

The simplest strategy: return the K most similar chunks.

Advantage: fast, predictable
Disadvantage: results may be redundant (top 5 chunks all say the same thing)

Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity. After finding the top match, it penalises subsequent results that are too similar to already-selected results.

Advantage: diverse context — covers more aspects of the question
Disadvantage: may include a slightly less relevant chunk for diversity

Re-ranking

A two-stage approach: first retrieve a larger set (top 20-50), then use a separate re-ranking model to score and re-order them, returning the best K.

Advantage: more accurate ranking than vector similarity alone
Disadvantage: adds latency (extra model call)

Retrieval strategies comparison
Feature	Speed	Result Quality	Diversity	Best For
Top-K	Fastest	Good	Low (may be redundant)	Simple use cases, low-latency requirements
MMR	Fast	Good	High (balances relevance + diversity)	Complex questions requiring broad context
Re-ranking	Slower (extra model call)	Best	Depends on re-ranker	High-accuracy requirements (legal, medical)

A/B testing retrieval configurations

Don’t guess which configuration is best — test it systematically.

from azure.ai.evaluation import evaluate, RelevanceEvaluator

# Configuration A: small chunks, strict threshold
config_a = {
    "chunk_size": 256,
    "overlap": 50,
    "threshold": 0.8,
    "top_k": 5,
}

# Configuration B: larger chunks, relaxed threshold
config_b = {
    "chunk_size": 512,
    "overlap": 100,
    "threshold": 0.7,
    "top_k": 3,
}

# Run evaluation on both configurations
results_a = evaluate(
    data="eval_dataset.jsonl",
    evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)

results_b = evaluate(
    data="eval_dataset_config_b.jsonl",
    evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)

print(f"Config A relevance: {results_a['metrics']['relevance']}")
print(f"Config B relevance: {results_b['metrics']['relevance']}")

What’s happening:

Lines 4-16: Define two retrieval configurations to compare
Lines 19-27: Run the same evaluation dataset through both configurations
Lines 29-30: Compare relevance scores — whichever configuration produces higher relevance is better for your use case

What to A/B test

Parameter	Test Range	Impact
Chunk size	128, 256, 512, 1024 tokens	Biggest impact on quality
Overlap	0%, 10%, 20%, 30%	Prevents boundary information loss
Similarity threshold	0.6, 0.7, 0.8, 0.85	Precision vs recall trade-off
Top-K	3, 5, 10, 20	More context vs noise
Retrieval strategy	Top-K vs MMR vs re-ranking	Quality vs latency

Azure AI Search integration for RAG

Azure AI Search is the primary retrieval backend for RAG in Azure. Here’s how to configure an index with optimised chunking:

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
)

# Configure vector search with HNSW algorithm
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="hnsw-config",
            parameters={
                "m": 4,          # Bi-directional link count
                "efConstruction": 400,  # Index build quality
                "efSearch": 500,        # Search quality
                "metric": "cosine",
            },
        ),
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-config",
        ),
    ],
)

# Configure semantic search (for hybrid search)
semantic_search = SemanticSearch(
    configurations=[
        SemanticConfiguration(
            name="semantic-config",
            prioritized_fields=SemanticPrioritizedFields(
                content_fields=[SemanticField(field_name="content")],
            ),
        ),
    ],
)

What’s happening:

Lines 16-33: Configure the HNSW (Hierarchical Navigable Small World) vector search algorithm — this controls how vectors are indexed and searched
Lines 19-24: HNSW parameters tune the quality-speed trade-off: higher efSearch = better results but slower
Lines 36-46: Semantic search configuration enables hybrid search (vector + keyword combined)

Scenario: Zara optimises Atlas's legal document RAG

Atlas Consulting’s legal document chatbot is returning irrelevant results. Zara investigates:

Problem: Chunks are 1024 tokens (too large). A single chunk contains three different legal clauses. When a user asks about “termination clause,” the retrieval returns a chunk that mentions termination but is mostly about payment terms.

Zara’s A/B test:

Configuration	Chunk Size	Overlap	Threshold	Top-K	Relevance Score
Current	1024 tokens	0%	0.6	5	3.2
Option A	256 tokens	20%	0.75	5	4.1
Option B	512 tokens	15%	0.7	3	4.4

Winner: Option B. The 512-token chunks keep enough legal context per clause without mixing unrelated clauses. The 15% overlap prevents splitting cross-referenced provisions. Fewer results (top 3) with a moderate threshold reduce noise while keeping coverage.

Marcus Webb approves the configuration change. Relevance jumps from 3.2 to 4.4 — a 38% improvement.

Key terms flashcards

Question

What is chunk overlap and why does it matter?

Click or press Enter to reveal answer

Answer

Overlap is the percentage of content shared between adjacent chunks (typically 10-20%). It prevents information loss at chunk boundaries — sentences that span two chunks are captured in both, ensuring retrieval finds the full context.

Click to flip back

Question

What is Maximum Marginal Relevance (MMR)?

Click or press Enter to reveal answer

Answer

A retrieval strategy that balances relevance with diversity. After selecting the most relevant chunk, it penalises subsequent results that are too similar to already-selected ones. Prevents redundant context where top-5 results all say the same thing.

Click to flip back

Question

What is re-ranking in RAG?

Click or press Enter to reveal answer

Answer

A two-stage retrieval approach: first retrieve a large set (top 20-50) using fast vector search, then use a separate re-ranking model to score and reorder them. Returns better results than vector similarity alone, but adds latency.

Click to flip back

Question

What should you A/B test first in RAG optimization?

Click or press Enter to reveal answer

Answer

Chunk size — it has the biggest impact on RAG quality. Test 128, 256, 512, and 1024 tokens. Then test overlap (0-30%), similarity threshold (0.6-0.85), top-K (3-20), and retrieval strategy.

Click to flip back

Knowledge check

Knowledge Check

Atlas Consulting's chatbot retrieves 5 chunks for every query, but users complain that answers repeat the same information from different angles instead of covering all aspects of their question. Which retrieval strategy should Zara switch to?

Knowledge Check

Dr. Luca's genomics RAG system frequently splits gene descriptions across two chunks, causing incomplete retrieval. The chunks are 256 tokens with 0% overlap. What should he try first?

Next up: Embeddings & Hybrid Search — because vector search alone isn’t enough.