Evaluation: Datasets, Metrics & Quality Gates
How do you know if your GenAI solution is actually good? Learn to create test datasets, implement quality metrics — groundedness, relevance, coherence, fluency — and build automated evaluation gates.
AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.
Why evaluate GenAI?
Evaluation is taste-testing before you open the restaurant.
Imagine you built a new restaurant. Before opening night, you have food critics come in and score every dish on four things:
- Freshness — is the food made from real ingredients, not made up? (Groundedness)
- What you ordered — did you get the steak you asked for, not a salad? (Relevance)
- Presentation — is the dish well-structured and logical on the plate? (Coherence)
- Taste — does it read well, no weird aftertaste? (Fluency)
You wouldn’t open a restaurant without tasting the food. Don’t deploy a GenAI app without evaluating it.
The four quality metrics
Every GenAI evaluation in Azure AI Foundry centres on four core metrics. Each is scored from 1 (worst) to 5 (best):
| What It Measures | Score 1 (Bad) | Score 5 (Good) | Most Critical For |
|---|---|---|---|
| Groundedness | Answer is fabricated / hallucinated | Every claim is supported by the provided context | RAG systems |
| Relevance | Answer ignores the question entirely | Answer directly and completely addresses the query | All GenAI apps |
| Coherence | Disjointed, contradictory, hard to follow | Logically structured, flows naturally, consistent | Long-form generation |
| Fluency | Broken grammar, unnatural phrasing | Natural, grammatically correct, reads like a human wrote it | Customer-facing apps |
Groundedness — the hallucination detector
Groundedness checks whether the model’s answer is supported by the context you provided. This is the most critical metric for RAG systems because the whole point of RAG is grounding answers in real documents.
Example:
| Context Provided | Question | Answer | Groundedness |
|---|---|---|---|
| ”Azure OpenAI is available in East US, West Europe, and Japan East." | "Where is Azure OpenAI available?" | "Azure OpenAI is available in East US, West Europe, and Japan East.” | 5 — fully grounded |
| ”Azure OpenAI is available in East US, West Europe, and Japan East." | "Where is Azure OpenAI available?" | "Azure OpenAI is available in all Azure regions worldwide.” | 1 — hallucinated |
Relevance — did you answer the question?
Relevance measures whether the response actually addresses what the user asked. A perfectly grounded, fluent, coherent answer can still score low on relevance if it answers the wrong question.
Coherence — does it make sense?
Coherence evaluates logical flow. Does the answer contradict itself? Is information presented in a sensible order? Does paragraph two follow from paragraph one?
Fluency — does it read well?
Fluency checks grammar, naturalness, and readability. A factually correct answer written in broken English still fails fluency.
Exam tip: Groundedness is king for RAG
In exam questions about RAG evaluation, groundedness is almost always the most important metric. The primary risk of RAG systems is hallucination — the model generating plausible-sounding answers that aren’t supported by the retrieved documents. Relevance comes second (did retrieval find the right docs?), followed by coherence and fluency.
If a question asks “which metric best detects hallucination,” the answer is groundedness.
Creating evaluation datasets
An evaluation dataset is a collection of test cases with known inputs and expected outputs. Think of it as the answer key for your exam.
Each test case typically includes:
| Field | Purpose | Example |
|---|---|---|
| query | The user’s question | ”What is our refund policy?“ |
| context | The retrieved documents (for RAG) | “Refund policy: full refund within 30 days…“ |
| response | The model’s actual output | ”You can get a full refund within 30 days.” |
| ground_truth | The ideal/expected answer | ”Full refund within 30 days of purchase.” |
Creating a dataset in Python
import pandas as pd
# Build evaluation dataset
eval_data = pd.DataFrame([
{
"query": "What is the refund policy?",
"context": "Our refund policy allows full refunds within 30 days of purchase. After 30 days, store credit only.",
"response": "You can get a full refund within 30 days of purchase. After that, store credit is available.",
"ground_truth": "Full refund within 30 days. Store credit after 30 days."
},
{
"query": "How do I contact support?",
"context": "Support is available via email at help@example.com or phone 0800-HELP between 9am-5pm NZST.",
"response": "You can reach support by emailing help@example.com or calling 0800-HELP between 9am and 5pm NZST.",
"ground_truth": "Email help@example.com or call 0800-HELP (9am-5pm NZST)."
}
])
# Save as JSONL for the evaluation SDK
eval_data.to_json("eval_dataset.jsonl", orient="records", lines=True)
What’s happening:
- Lines 4-18: Each row is one test case with query, context, response (the model’s actual output), and ground truth
- Line 21: Saves as JSONL (JSON Lines) — one JSON object per line, the format the evaluation SDK expects
Data mapping
Data mapping tells the evaluator which columns in your dataset map to which evaluation inputs. This is critical when your column names don’t match the defaults.
from azure.ai.evaluation import GroundednessEvaluator
# Column mapping: your dataset columns → evaluator inputs
column_mapping = {
"query": "query", # User question
"context": "context", # Retrieved documents
"response": "response", # Model output
"ground_truth": "ground_truth" # Expected answer
}
What’s happening:
- The left side is the evaluator’s expected input name
- The right side is your dataset’s column name
- If your dataset uses “user_question” instead of “query,” you’d map “query” to “user_question”
Running evaluations with the SDK
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
GroundednessEvaluator,
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
)
# Run evaluation across your dataset
results = evaluate(
data="eval_dataset.jsonl",
evaluators={
"groundedness": GroundednessEvaluator(model_config=model_config),
"relevance": RelevanceEvaluator(model_config=model_config),
"coherence": CoherenceEvaluator(model_config=model_config),
"fluency": FluencyEvaluator(model_config=model_config),
},
evaluator_config={
"groundedness": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}",
}
},
"default": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
},
)
# Check aggregate scores
print(results["metrics"])
# Output: groundedness: 4.2, relevance: 4.5, coherence: 4.8, fluency: 4.9
What’s happening:
- Lines 1-8: Import the four core evaluators from the azure-ai-evaluation SDK
- Lines 11-35: Run all evaluators against your dataset in one call. Evaluator instances are created inline.
- Lines 20-34:
evaluator_configwithcolumn_mappingtells each evaluator which dataset fields to read. Groundedness needs query, context, and response. The default mapping (relevance, coherence, fluency) needs query and response. - Line 38: Aggregate scores across all test cases — one number per metric
Scenario: Zara evaluates Atlas's client-facing chatbot
Zara Okonkwo, GenAI engineer at Atlas Consulting, is preparing to launch their client-facing chatbot. Marcus Webb (her lead) sets the quality bar:
- Groundedness must be at least 4.0 (no hallucinated legal advice)
- Relevance must be at least 4.5 (clients expect precise answers)
- Coherence must be at least 4.0
- Fluency must be at least 4.5 (professional communication)
Zara creates a dataset of 200 test cases covering the top client questions, runs evaluation, and gets:
| Metric | Score | Threshold | Pass? |
|---|---|---|---|
| Groundedness | 3.8 | 4.0 | Fail |
| Relevance | 4.6 | 4.5 | Pass |
| Coherence | 4.7 | 4.0 | Pass |
| Fluency | 4.8 | 4.5 | Pass |
Groundedness failed — the chatbot occasionally cites information not in the retrieved documents. Zara investigates the failing test cases and finds the retrieval step is returning irrelevant chunks. She tunes the chunk size (next domain!) and re-runs evaluation until groundedness hits 4.2.
Exam tip: What each evaluator needs
Different evaluators have different data requirements:
- Groundedness needs context to check the response against. Without context, it cannot detect hallucinations.
- Relevance, coherence, and fluency need the query and response but do NOT require ground truth.
- Ground truth (the ideal answer) is useful for similarity-based comparisons but is not required by the built-in AI-judged evaluators.
The exam tests whether you understand these data dependencies — particularly that groundedness requires context.
Key terms flashcards
Knowledge check
Zara's chatbot evaluation shows a groundedness score of 2.1 but relevance of 4.8. What does this tell her about the system?
Dr. Fatima is setting up evaluation for Meridian's financial advice chatbot. She has a dataset with columns: 'customer_question', 'retrieved_docs', 'bot_answer', and 'approved_response'. How should she map these to the groundedness evaluator?
🎬 Video coming soon
Next up: Safety Evaluations & Custom Metrics — because quality isn’t just about accuracy, it’s about safety.