Evaluation: Datasets, Metrics & Quality Gates

AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.

Why evaluate GenAI?

Simple explanation

Evaluation is taste-testing before you open the restaurant.

Imagine you built a new restaurant. Before opening night, you have food critics come in and score every dish on four things:

Freshness — is the food made from real ingredients, not made up? (Groundedness)
What you ordered — did you get the steak you asked for, not a salad? (Relevance)
Presentation — is the dish well-structured and logical on the plate? (Coherence)
Taste — does it read well, no weird aftertaste? (Fluency)

You wouldn’t open a restaurant without tasting the food. Don’t deploy a GenAI app without evaluating it.

The four quality metrics

Every GenAI evaluation in Azure AI Foundry centres on four core metrics. Each is scored from 1 (worst) to 5 (best):

The four core GenAI quality metrics
What It Measures	Score 1 (Bad)	Score 5 (Good)	Most Critical For
Groundedness	Answer is fabricated / hallucinated	Every claim is supported by the provided context	RAG systems
Relevance	Answer ignores the question entirely	Answer directly and completely addresses the query	All GenAI apps
Coherence	Disjointed, contradictory, hard to follow	Logically structured, flows naturally, consistent	Long-form generation
Fluency	Broken grammar, unnatural phrasing	Natural, grammatically correct, reads like a human wrote it	Customer-facing apps

Groundedness — the hallucination detector

Groundedness checks whether the model’s answer is supported by the context you provided. This is the most critical metric for RAG systems because the whole point of RAG is grounding answers in real documents.

Example:

Context Provided	Question	Answer	Groundedness
”Azure OpenAI is available in East US, West Europe, and Japan East."	"Where is Azure OpenAI available?"	"Azure OpenAI is available in East US, West Europe, and Japan East.”	5 — fully grounded
”Azure OpenAI is available in East US, West Europe, and Japan East."	"Where is Azure OpenAI available?"	"Azure OpenAI is available in all Azure regions worldwide.”	1 — hallucinated

Relevance — did you answer the question?

Relevance measures whether the response actually addresses what the user asked. A perfectly grounded, fluent, coherent answer can still score low on relevance if it answers the wrong question.

Coherence — does it make sense?

Coherence evaluates logical flow. Does the answer contradict itself? Is information presented in a sensible order? Does paragraph two follow from paragraph one?

Fluency — does it read well?

Fluency checks grammar, naturalness, and readability. A factually correct answer written in broken English still fails fluency.

Exam tip: Groundedness is king for RAG

In exam questions about RAG evaluation, groundedness is almost always the most important metric. The primary risk of RAG systems is hallucination — the model generating plausible-sounding answers that aren’t supported by the retrieved documents. Relevance comes second (did retrieval find the right docs?), followed by coherence and fluency.

If a question asks “which metric best detects hallucination,” the answer is groundedness.

Creating evaluation datasets

An evaluation dataset is a collection of test cases with known inputs and expected outputs. Think of it as the answer key for your exam.

Each test case typically includes:

Field	Purpose	Example
query	The user’s question	”What is our refund policy?“
context	The retrieved documents (for RAG)	“Refund policy: full refund within 30 days…“
response	The model’s actual output	”You can get a full refund within 30 days.”
ground_truth	The ideal/expected answer	”Full refund within 30 days of purchase.”

Creating a dataset in Python

import pandas as pd

# Build evaluation dataset
eval_data = pd.DataFrame([
    {
        "query": "What is the refund policy?",
        "context": "Our refund policy allows full refunds within 30 days of purchase. After 30 days, store credit only.",
        "response": "You can get a full refund within 30 days of purchase. After that, store credit is available.",
        "ground_truth": "Full refund within 30 days. Store credit after 30 days."
    },
    {
        "query": "How do I contact support?",
        "context": "Support is available via email at help@example.com or phone 0800-HELP between 9am-5pm NZST.",
        "response": "You can reach support by emailing help@example.com or calling 0800-HELP between 9am and 5pm NZST.",
        "ground_truth": "Email help@example.com or call 0800-HELP (9am-5pm NZST)."
    }
])

# Save as JSONL for the evaluation SDK
eval_data.to_json("eval_dataset.jsonl", orient="records", lines=True)

What’s happening:

Lines 4-18: Each row is one test case with query, context, response (the model’s actual output), and ground truth
Line 21: Saves as JSONL (JSON Lines) — one JSON object per line, the format the evaluation SDK expects

Data mapping

Data mapping tells the evaluator which columns in your dataset map to which evaluation inputs. This is critical when your column names don’t match the defaults.

from azure.ai.evaluation import GroundednessEvaluator

# Column mapping: your dataset columns → evaluator inputs
column_mapping = {
    "query": "query",           # User question
    "context": "context",       # Retrieved documents
    "response": "response",     # Model output
    "ground_truth": "ground_truth"  # Expected answer
}

What’s happening:

The left side is the evaluator’s expected input name
The right side is your dataset’s column name
If your dataset uses “user_question” instead of “query,” you’d map “query” to “user_question”

Running evaluations with the SDK

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
)

# Run evaluation across your dataset
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "groundedness": GroundednessEvaluator(model_config=model_config),
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
        "fluency": FluencyEvaluator(model_config=model_config),
    },
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}",
            }
        },
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

# Check aggregate scores
print(results["metrics"])
# Output: groundedness: 4.2, relevance: 4.5, coherence: 4.8, fluency: 4.9

What’s happening:

Lines 1-8: Import the four core evaluators from the azure-ai-evaluation SDK
Lines 11-35: Run all evaluators against your dataset in one call. Evaluator instances are created inline.
Lines 20-34: evaluator_config with column_mapping tells each evaluator which dataset fields to read. Groundedness needs query, context, and response. The default mapping (relevance, coherence, fluency) needs query and response.
Line 38: Aggregate scores across all test cases — one number per metric

Scenario: Zara evaluates Atlas's client-facing chatbot

Zara Okonkwo, GenAI engineer at Atlas Consulting, is preparing to launch their client-facing chatbot. Marcus Webb (her lead) sets the quality bar:

Groundedness must be at least 4.0 (no hallucinated legal advice)
Relevance must be at least 4.5 (clients expect precise answers)
Coherence must be at least 4.0
Fluency must be at least 4.5 (professional communication)

Zara creates a dataset of 200 test cases covering the top client questions, runs evaluation, and gets:

Metric	Score	Threshold	Pass?
Groundedness	3.8	4.0	Fail
Relevance	4.6	4.5	Pass
Coherence	4.7	4.0	Pass
Fluency	4.8	4.5	Pass

Groundedness failed — the chatbot occasionally cites information not in the retrieved documents. Zara investigates the failing test cases and finds the retrieval step is returning irrelevant chunks. She tunes the chunk size (next domain!) and re-runs evaluation until groundedness hits 4.2.

Exam tip: What each evaluator needs

Different evaluators have different data requirements:

Groundedness needs context to check the response against. Without context, it cannot detect hallucinations.
Relevance, coherence, and fluency need the query and response but do NOT require ground truth.
Ground truth (the ideal answer) is useful for similarity-based comparisons but is not required by the built-in AI-judged evaluators.

The exam tests whether you understand these data dependencies — particularly that groundedness requires context.

Key terms flashcards

Question

What are the four core GenAI quality metrics?

Click or press Enter to reveal answer

Answer

Groundedness (is it based on provided context?), Relevance (does it answer the question?), Coherence (is it logically structured?), Fluency (is the language natural and correct?). Each scored 1-5.

Click to flip back

Question

What is groundedness and why is it critical for RAG?

Click or press Enter to reveal answer

Answer

Groundedness measures whether the model's answer is supported by the provided context. It's the primary hallucination detector — critical for RAG because the whole point is grounding answers in retrieved documents.

Click to flip back

Question

What fields does an evaluation dataset need?

Click or press Enter to reveal answer

Answer

Query (user question), Context (retrieved documents), Response (model output), Ground Truth (expected answer). Context is required for groundedness; ground truth improves accuracy of other metrics.

Click to flip back

Question

What is data mapping in GenAI evaluation?

Click or press Enter to reveal answer

Answer

Data mapping tells the evaluator which columns in your dataset correspond to which evaluation inputs (query, context, response, ground_truth). Required when your column names differ from the SDK defaults.

Click to flip back

Knowledge check

Knowledge Check

Zara's chatbot evaluation shows a groundedness score of 2.1 but relevance of 4.8. What does this tell her about the system?

Knowledge Check

Dr. Fatima is setting up evaluation for Meridian's financial advice chatbot. She has a dataset with columns: 'customer_question', 'retrieved_docs', 'bot_answer', and 'approved_response'. How should she map these to the groundedness evaluator?

Next up: Safety Evaluations & Custom Metrics — because quality isn’t just about accuracy, it’s about safety.