RAG Optimization: Better Retrieval, Better Answers
RAG is only as good as its retrieval. Master chunking strategies, similarity thresholds, retrieval tuning, and A/B testing frameworks to make your GenAI actually answer questions correctly.
AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.
Why RAG optimization matters
RAG is like looking up answers in a textbook before answering a question.
Imagine you’re in an exam and you’re allowed to use a textbook. The quality of your answer depends entirely on whether you find the RIGHT page. If you flip to a random chapter, your answer will be wrong — even if you’re brilliant at writing.
RAG optimization is about making sure you find the right page every time. That means: How big are the sections you search? (Chunking.) How picky are you about matches? (Similarity threshold.) How do you pick the best results? (Retrieval strategy.)
Get retrieval wrong and even GPT-4o gives bad answers. Get it right and even a smaller model shines.
Chunking strategies
Chunking is how you split source documents into segments for indexing. It’s the single most impactful tuning parameter for RAG quality.
| Feature | Chunk Size | Overlap | Best For | Risk |
|---|---|---|---|---|
| Fixed-size | 256-512 tokens | 10-20% overlap | General purpose, simple implementation | Splits mid-sentence, breaks context |
| Sentence-based | 1-5 sentences | 1 sentence overlap | FAQ documents, short-form content | May be too small for complex topics |
| Semantic | Variable (by topic) | Topic boundary overlap | Long documents with clear topic shifts | Complex to implement, slower indexing |
| Document structure | Headers/sections | Section overlap | Structured docs (legal, technical) | Requires well-formatted source docs |
Why overlap matters
Chunks overlap to prevent losing information at boundaries. Without overlap, a sentence that spans two chunks gets split — and neither chunk has the full meaning.
Example without overlap:
| Chunk 1 | Chunk 2 |
|---|---|
| ”The refund policy allows returns within 30 days." | "After the 30-day period, only store credit is available for items in original packaging.” |
A question about “store credit conditions” might only match Chunk 2, missing the 30-day context from Chunk 1.
Example with 20% overlap:
| Chunk 1 | Chunk 2 |
|---|---|
| ”The refund policy allows returns within 30 days. After the 30-day period, only store credit…" | "After the 30-day period, only store credit is available for items in original packaging.” |
Now both chunks contain the boundary information. The overlap ensures retrieval finds the full context.
Exam tip: Chunk size is the number-one tuning parameter
If a question asks “what should you tune first to improve RAG quality,” the answer is almost always chunk size.
Rules of thumb:
- Too small (under 100 tokens): loses context, lots of noise in results
- Too large (over 1000 tokens): dilutes relevance, wastes context window
- Sweet spot: 256-512 tokens for most use cases
- Always use overlap: 10-20% prevents information loss at boundaries
The exam may present a scenario where retrieval returns irrelevant results. First check: are the chunks the right size?
Similarity thresholds
After retrieval, each result has a similarity score (0 to 1). The similarity threshold controls the minimum score a chunk must have to be included in the context.
| Threshold | Effect | Risk |
|---|---|---|
| Low (0.5-0.6) | Returns many results, high recall | Includes irrelevant chunks (noise) |
| Medium (0.7-0.8) | Balanced precision and recall | May miss edge-case matches |
| High (0.85+) | Only very relevant results | Misses valid results with different wording |
Finding the right threshold
There’s no universal “correct” threshold. It depends on:
- Vocabulary consistency: Technical docs with consistent terminology can use higher thresholds
- Query diversity: If users ask the same question in many ways, lower thresholds catch more variants
- Consequence of missing: Medical/legal — lower threshold (better to retrieve too much). Casual FAQ — higher threshold (reduce noise)
Retrieval strategies
Top-K retrieval
The simplest strategy: return the K most similar chunks.
- Advantage: fast, predictable
- Disadvantage: results may be redundant (top 5 chunks all say the same thing)
Maximum Marginal Relevance (MMR)
MMR balances relevance with diversity. After finding the top match, it penalises subsequent results that are too similar to already-selected results.
- Advantage: diverse context — covers more aspects of the question
- Disadvantage: may include a slightly less relevant chunk for diversity
Re-ranking
A two-stage approach: first retrieve a larger set (top 20-50), then use a separate re-ranking model to score and re-order them, returning the best K.
- Advantage: more accurate ranking than vector similarity alone
- Disadvantage: adds latency (extra model call)
| Feature | Speed | Result Quality | Diversity | Best For |
|---|---|---|---|---|
| Top-K | Fastest | Good | Low (may be redundant) | Simple use cases, low-latency requirements |
| MMR | Fast | Good | High (balances relevance + diversity) | Complex questions requiring broad context |
| Re-ranking | Slower (extra model call) | Best | Depends on re-ranker | High-accuracy requirements (legal, medical) |
A/B testing retrieval configurations
Don’t guess which configuration is best — test it systematically.
from azure.ai.evaluation import evaluate, RelevanceEvaluator
# Configuration A: small chunks, strict threshold
config_a = {
"chunk_size": 256,
"overlap": 50,
"threshold": 0.8,
"top_k": 5,
}
# Configuration B: larger chunks, relaxed threshold
config_b = {
"chunk_size": 512,
"overlap": 100,
"threshold": 0.7,
"top_k": 3,
}
# Run evaluation on both configurations
results_a = evaluate(
data="eval_dataset.jsonl",
evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)
results_b = evaluate(
data="eval_dataset_config_b.jsonl",
evaluators={"relevance": RelevanceEvaluator(model_config=model_config)},
)
print(f"Config A relevance: {results_a['metrics']['relevance']}")
print(f"Config B relevance: {results_b['metrics']['relevance']}")
What’s happening:
- Lines 4-16: Define two retrieval configurations to compare
- Lines 19-27: Run the same evaluation dataset through both configurations
- Lines 29-30: Compare relevance scores — whichever configuration produces higher relevance is better for your use case
What to A/B test
| Parameter | Test Range | Impact |
|---|---|---|
| Chunk size | 128, 256, 512, 1024 tokens | Biggest impact on quality |
| Overlap | 0%, 10%, 20%, 30% | Prevents boundary information loss |
| Similarity threshold | 0.6, 0.7, 0.8, 0.85 | Precision vs recall trade-off |
| Top-K | 3, 5, 10, 20 | More context vs noise |
| Retrieval strategy | Top-K vs MMR vs re-ranking | Quality vs latency |
Azure AI Search integration for RAG
Azure AI Search is the primary retrieval backend for RAG in Azure. Here’s how to configure an index with optimised chunking:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
SemanticConfiguration,
SemanticField,
SemanticPrioritizedFields,
SemanticSearch,
)
# Configure vector search with HNSW algorithm
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="hnsw-config",
parameters={
"m": 4, # Bi-directional link count
"efConstruction": 400, # Index build quality
"efSearch": 500, # Search quality
"metric": "cosine",
},
),
],
profiles=[
VectorSearchProfile(
name="vector-profile",
algorithm_configuration_name="hnsw-config",
),
],
)
# Configure semantic search (for hybrid search)
semantic_search = SemanticSearch(
configurations=[
SemanticConfiguration(
name="semantic-config",
prioritized_fields=SemanticPrioritizedFields(
content_fields=[SemanticField(field_name="content")],
),
),
],
)
What’s happening:
- Lines 16-33: Configure the HNSW (Hierarchical Navigable Small World) vector search algorithm — this controls how vectors are indexed and searched
- Lines 19-24: HNSW parameters tune the quality-speed trade-off: higher
efSearch= better results but slower - Lines 36-46: Semantic search configuration enables hybrid search (vector + keyword combined)
Scenario: Zara optimises Atlas's legal document RAG
Atlas Consulting’s legal document chatbot is returning irrelevant results. Zara investigates:
Problem: Chunks are 1024 tokens (too large). A single chunk contains three different legal clauses. When a user asks about “termination clause,” the retrieval returns a chunk that mentions termination but is mostly about payment terms.
Zara’s A/B test:
| Configuration | Chunk Size | Overlap | Threshold | Top-K | Relevance Score |
|---|---|---|---|---|---|
| Current | 1024 tokens | 0% | 0.6 | 5 | 3.2 |
| Option A | 256 tokens | 20% | 0.75 | 5 | 4.1 |
| Option B | 512 tokens | 15% | 0.7 | 3 | 4.4 |
Winner: Option B. The 512-token chunks keep enough legal context per clause without mixing unrelated clauses. The 15% overlap prevents splitting cross-referenced provisions. Fewer results (top 3) with a moderate threshold reduce noise while keeping coverage.
Marcus Webb approves the configuration change. Relevance jumps from 3.2 to 4.4 — a 38% improvement.
Key terms flashcards
Knowledge check
Atlas Consulting's chatbot retrieves 5 chunks for every query, but users complain that answers repeat the same information from different angles instead of covering all aspects of their question. Which retrieval strategy should Zara switch to?
Dr. Luca's genomics RAG system frequently splits gene descriptions across two chunks, causing incomplete retrieval. The chunks are 256 tokens with 0% overlap. What should he try first?
🎬 Video coming soon
Next up: Embeddings & Hybrid Search — because vector search alone isn’t enough.