🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 4
Domain 4 — Module 1 of 4 25%
19 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 4: Implement Generative AI Quality Assurance and Observability Free ⏱ ~14 min read

Evaluation: Datasets, Metrics & Quality Gates

How do you know if your GenAI solution is actually good? Learn to create test datasets, implement quality metrics — groundedness, relevance, coherence, fluency — and build automated evaluation gates.

AI-300 is a BETA exam. Content may change before general availability (~June-July 2026). This guide is based on the official study guide published by Microsoft. We’ll update as the exam evolves.

Why evaluate GenAI?

☕ Simple explanation

Evaluation is taste-testing before you open the restaurant.

Imagine you built a new restaurant. Before opening night, you have food critics come in and score every dish on four things:

  • Freshness — is the food made from real ingredients, not made up? (Groundedness)
  • What you ordered — did you get the steak you asked for, not a salad? (Relevance)
  • Presentation — is the dish well-structured and logical on the plate? (Coherence)
  • Taste — does it read well, no weird aftertaste? (Fluency)

You wouldn’t open a restaurant without tasting the food. Don’t deploy a GenAI app without evaluating it.

GenAI evaluation is the systematic process of measuring model output quality against defined criteria. Unlike traditional ML (where accuracy/F1 suffice), GenAI outputs are free-form text, requiring multi-dimensional quality assessment:

  • Groundedness — factual alignment with provided context
  • Relevance — semantic alignment with the user query
  • Coherence — logical structure and internal consistency
  • Fluency — grammatical correctness and naturalness

Azure AI Foundry provides built-in evaluators for all four metrics, scored on a 1-5 scale, and supports custom evaluators for domain-specific needs.

The four quality metrics

Every GenAI evaluation in Azure AI Foundry centres on four core metrics. Each is scored from 1 (worst) to 5 (best):

The four core GenAI quality metrics
What It MeasuresScore 1 (Bad)Score 5 (Good)Most Critical For
GroundednessAnswer is fabricated / hallucinatedEvery claim is supported by the provided contextRAG systems
RelevanceAnswer ignores the question entirelyAnswer directly and completely addresses the queryAll GenAI apps
CoherenceDisjointed, contradictory, hard to followLogically structured, flows naturally, consistentLong-form generation
FluencyBroken grammar, unnatural phrasingNatural, grammatically correct, reads like a human wrote itCustomer-facing apps

Groundedness — the hallucination detector

Groundedness checks whether the model’s answer is supported by the context you provided. This is the most critical metric for RAG systems because the whole point of RAG is grounding answers in real documents.

Example:

Context ProvidedQuestionAnswerGroundedness
”Azure OpenAI is available in East US, West Europe, and Japan East.""Where is Azure OpenAI available?""Azure OpenAI is available in East US, West Europe, and Japan East.”5 — fully grounded
”Azure OpenAI is available in East US, West Europe, and Japan East.""Where is Azure OpenAI available?""Azure OpenAI is available in all Azure regions worldwide.”1 — hallucinated

Relevance — did you answer the question?

Relevance measures whether the response actually addresses what the user asked. A perfectly grounded, fluent, coherent answer can still score low on relevance if it answers the wrong question.

Coherence — does it make sense?

Coherence evaluates logical flow. Does the answer contradict itself? Is information presented in a sensible order? Does paragraph two follow from paragraph one?

Fluency — does it read well?

Fluency checks grammar, naturalness, and readability. A factually correct answer written in broken English still fails fluency.

💡 Exam tip: Groundedness is king for RAG

In exam questions about RAG evaluation, groundedness is almost always the most important metric. The primary risk of RAG systems is hallucination — the model generating plausible-sounding answers that aren’t supported by the retrieved documents. Relevance comes second (did retrieval find the right docs?), followed by coherence and fluency.

If a question asks “which metric best detects hallucination,” the answer is groundedness.

Creating evaluation datasets

An evaluation dataset is a collection of test cases with known inputs and expected outputs. Think of it as the answer key for your exam.

Each test case typically includes:

FieldPurposeExample
queryThe user’s question”What is our refund policy?“
contextThe retrieved documents (for RAG)“Refund policy: full refund within 30 days…“
responseThe model’s actual output”You can get a full refund within 30 days.”
ground_truthThe ideal/expected answer”Full refund within 30 days of purchase.”

Creating a dataset in Python

import pandas as pd

# Build evaluation dataset
eval_data = pd.DataFrame([
    {
        "query": "What is the refund policy?",
        "context": "Our refund policy allows full refunds within 30 days of purchase. After 30 days, store credit only.",
        "response": "You can get a full refund within 30 days of purchase. After that, store credit is available.",
        "ground_truth": "Full refund within 30 days. Store credit after 30 days."
    },
    {
        "query": "How do I contact support?",
        "context": "Support is available via email at help@example.com or phone 0800-HELP between 9am-5pm NZST.",
        "response": "You can reach support by emailing help@example.com or calling 0800-HELP between 9am and 5pm NZST.",
        "ground_truth": "Email help@example.com or call 0800-HELP (9am-5pm NZST)."
    }
])

# Save as JSONL for the evaluation SDK
eval_data.to_json("eval_dataset.jsonl", orient="records", lines=True)

What’s happening:

  • Lines 4-18: Each row is one test case with query, context, response (the model’s actual output), and ground truth
  • Line 21: Saves as JSONL (JSON Lines) — one JSON object per line, the format the evaluation SDK expects

Data mapping

Data mapping tells the evaluator which columns in your dataset map to which evaluation inputs. This is critical when your column names don’t match the defaults.

from azure.ai.evaluation import GroundednessEvaluator

# Column mapping: your dataset columns → evaluator inputs
column_mapping = {
    "query": "query",           # User question
    "context": "context",       # Retrieved documents
    "response": "response",     # Model output
    "ground_truth": "ground_truth"  # Expected answer
}

What’s happening:

  • The left side is the evaluator’s expected input name
  • The right side is your dataset’s column name
  • If your dataset uses “user_question” instead of “query,” you’d map “query” to “user_question”

Running evaluations with the SDK

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
)

# Run evaluation across your dataset
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "groundedness": GroundednessEvaluator(model_config=model_config),
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
        "fluency": FluencyEvaluator(model_config=model_config),
    },
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}",
            }
        },
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

# Check aggregate scores
print(results["metrics"])
# Output: groundedness: 4.2, relevance: 4.5, coherence: 4.8, fluency: 4.9

What’s happening:

  • Lines 1-8: Import the four core evaluators from the azure-ai-evaluation SDK
  • Lines 11-35: Run all evaluators against your dataset in one call. Evaluator instances are created inline.
  • Lines 20-34: evaluator_config with column_mapping tells each evaluator which dataset fields to read. Groundedness needs query, context, and response. The default mapping (relevance, coherence, fluency) needs query and response.
  • Line 38: Aggregate scores across all test cases — one number per metric
Scenario: Zara evaluates Atlas's client-facing chatbot

Zara Okonkwo, GenAI engineer at Atlas Consulting, is preparing to launch their client-facing chatbot. Marcus Webb (her lead) sets the quality bar:

  • Groundedness must be at least 4.0 (no hallucinated legal advice)
  • Relevance must be at least 4.5 (clients expect precise answers)
  • Coherence must be at least 4.0
  • Fluency must be at least 4.5 (professional communication)

Zara creates a dataset of 200 test cases covering the top client questions, runs evaluation, and gets:

MetricScoreThresholdPass?
Groundedness3.84.0Fail
Relevance4.64.5Pass
Coherence4.74.0Pass
Fluency4.84.5Pass

Groundedness failed — the chatbot occasionally cites information not in the retrieved documents. Zara investigates the failing test cases and finds the retrieval step is returning irrelevant chunks. She tunes the chunk size (next domain!) and re-runs evaluation until groundedness hits 4.2.

💡 Exam tip: What each evaluator needs

Different evaluators have different data requirements:

  • Groundedness needs context to check the response against. Without context, it cannot detect hallucinations.
  • Relevance, coherence, and fluency need the query and response but do NOT require ground truth.
  • Ground truth (the ideal answer) is useful for similarity-based comparisons but is not required by the built-in AI-judged evaluators.

The exam tests whether you understand these data dependencies — particularly that groundedness requires context.

Key terms flashcards

Question

What are the four core GenAI quality metrics?

Click or press Enter to reveal answer

Answer

Groundedness (is it based on provided context?), Relevance (does it answer the question?), Coherence (is it logically structured?), Fluency (is the language natural and correct?). Each scored 1-5.

Click to flip back

Question

What is groundedness and why is it critical for RAG?

Click or press Enter to reveal answer

Answer

Groundedness measures whether the model's answer is supported by the provided context. It's the primary hallucination detector — critical for RAG because the whole point is grounding answers in retrieved documents.

Click to flip back

Question

What fields does an evaluation dataset need?

Click or press Enter to reveal answer

Answer

Query (user question), Context (retrieved documents), Response (model output), Ground Truth (expected answer). Context is required for groundedness; ground truth improves accuracy of other metrics.

Click to flip back

Question

What is data mapping in GenAI evaluation?

Click or press Enter to reveal answer

Answer

Data mapping tells the evaluator which columns in your dataset correspond to which evaluation inputs (query, context, response, ground_truth). Required when your column names differ from the SDK defaults.

Click to flip back

Knowledge check

Knowledge Check

Zara's chatbot evaluation shows a groundedness score of 2.1 but relevance of 4.8. What does this tell her about the system?

Knowledge Check

Dr. Fatima is setting up evaluation for Meridian's financial advice chatbot. She has a dataset with columns: 'customer_question', 'retrieved_docs', 'bot_answer', and 'approved_response'. How should she map these to the groundedness evaluator?

🎬 Video coming soon


Next up: Safety Evaluations & Custom Metrics — because quality isn’t just about accuracy, it’s about safety.

← Previous

PromptOps: Design, Compare, Version & Ship

Next →

Safety Evaluations & Custom Metrics

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.