🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 4
Domain 4 — Module 2 of 4 50%
20 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 4: Implement Generative AI Quality Assurance and Observability Premium ⏱ ~13 min read

Safety Evaluations & Custom Metrics

Quality isn't just about accuracy — it's about safety. Configure risk evaluations for harmful content, build custom metrics for domain-specific needs, and automate evaluation workflows.

Why safety evaluation?

☕ Simple explanation

Safety evaluation is checking for allergens in food, not just whether it tastes good.

A restaurant can serve the most delicious peanut curry in the world — but if a customer has a peanut allergy, delicious doesn’t matter. You need allergen checks SEPARATE from taste tests.

GenAI safety works the same way. Your chatbot might give accurate, relevant, fluent answers — but if one in a thousand responses contains harmful content, that’s a crisis. Safety evaluations catch the “allergens” that quality metrics miss.

Safety evaluations assess whether model outputs contain harmful, offensive, or dangerous content. These are separate from quality metrics (groundedness, relevance, coherence, fluency) and measure a different dimension:

  • Quality metrics ask: “Is this answer good?”
  • Safety metrics ask: “Is this answer harmful?”

An answer can score 5/5 on quality and still fail safety. Azure AI Foundry provides built-in safety evaluators for four risk categories, aligned with Microsoft’s Responsible AI principles.

Safety metric categories

Azure AI Foundry evaluates four categories of content risk:

CategoryWhat It DetectsSeverity LevelsExample
Hate and unfairnessDiscriminatory content targeting protected groupsVery Low, Low, Medium, HighBiased hiring recommendations based on ethnicity
ViolenceContent promoting or describing violenceVery Low, Low, Medium, HighInstructions for causing physical harm
Self-harmContent encouraging self-destructive behaviourVery Low, Low, Medium, HighRomanticising or instructing harmful behaviour
Sexual contentInappropriate sexual contentVery Low, Low, Medium, HighExplicit content in a professional chatbot

Each category outputs a severity level (not a 1-5 score). Your quality gate defines which severity levels are acceptable for your application.

💡 Exam tip: Safety severity vs quality scores

Don’t confuse the two scoring systems:

  • Quality metrics (groundedness, relevance, coherence, fluency): scored 1-5
  • Safety metrics (hate, violence, self-harm, sexual): severity levels (Very Low / Low / Medium / High)

The exam may present a scenario mixing both. Remember: a response must pass BOTH quality AND safety thresholds to be acceptable.

Configuring safety evaluators

from azure.ai.evaluation import (
    ViolenceEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    SexualEvaluator,
)

# Each evaluator uses an AI judge model
violence_eval = ViolenceEvaluator(azure_ai_project=project_scope)
self_harm_eval = SelfHarmEvaluator(azure_ai_project=project_scope)
hate_eval = HateUnfairnessEvaluator(azure_ai_project=project_scope)
sexual_eval = SexualEvaluator(azure_ai_project=project_scope)

# Evaluate a single response
result = violence_eval(
    query="How do I handle a difficult customer?",
    response="Here's a professional de-escalation approach..."
)
# result: {"violence": "Very low", "violence_score": 0, "violence_reason": "..."}

What’s happening:

  • Lines 1-6: Import the four safety evaluator classes
  • Lines 9-12: Create evaluator instances pointing to your Azure AI Foundry project (which hosts the judge model)
  • Lines 15-18: Evaluate a single query-response pair
  • Line 20: Result includes severity label, numeric score (0-7), and reasoning

Running safety evaluations at scale

from azure.ai.evaluation import evaluate

# Combine quality + safety evaluators in one run
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "groundedness": groundedness_eval,
        "relevance": relevance_eval,
        "violence": violence_eval,
        "self_harm": self_harm_eval,
        "hate_unfairness": hate_eval,
        "sexual": sexual_eval,
    },
    evaluator_config={
        "violence": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "self_harm": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "hate_unfairness": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "sexual": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

# Check for any high-severity findings
safety_flags = [r for r in results["rows"] if r.get("violence_score", 0) >= 5]

What’s happening:

  • Lines 4-31: Run quality AND safety evaluators together against the full dataset
  • Lines 14-30: Safety evaluators only need query and response (no context or ground truth)
  • Line 34: Filter for high-severity safety findings (score 5+ out of 7 is Medium-High)

Built-in vs custom metrics

Built-in metrics vs custom metrics
FeatureScopeSetup TimeMaintenanceBest For
Built-in MetricsGeneral quality + safety (8 evaluators)Minutes — import and configureNone — Microsoft maintains themStandard GenAI quality gates
Custom MetricsDomain-specific requirementsHours — write evaluator logicYou maintain and updateRegulated industries, specialised domains

Building custom evaluation metrics

Sometimes the built-in metrics aren’t enough. A financial chatbot needs to check regulatory compliance. A medical chatbot needs to verify drug interaction warnings. These need custom evaluators.

Custom evaluator as a callable class

from azure.ai.evaluation import evaluate

# Custom evaluator as a callable class
class ComplianceEvaluator:
    def __init__(self, model_config):
        self.model_config = model_config

    def __call__(self, *, query, response, **kwargs):
        # Custom logic to check financial compliance
        # Returns a dict with score and reasoning
        has_disclaimer = "not financial advice" in response.lower()
        mentions_risk = any(w in response.lower() for w in ["risk", "past performance"])
        score = 1.0 if (has_disclaimer and mentions_risk) else 0.0
        return {"compliance_score": score}

# Use in an evaluation run
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "compliance": ComplianceEvaluator(model_config=model_config),
    },
    evaluator_config={
        "compliance": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

What’s happening:

  • Lines 4-16: A callable class that implements custom evaluation logic. The __call__ method receives keyword arguments matching the column mapping.
  • The evaluator checks for required disclaimer phrases and returns a score (deterministic, no LLM needed)
  • Lines 19-32: The custom evaluator plugs into evaluate() just like built-in evaluators, with column_mapping in evaluator_config
  • This pattern lets you evaluate anything: tone, brand voice, regulatory compliance, clinical accuracy

Code-based custom evaluator

import re

def disclaimer_check(response: str, **kwargs) -> dict:
    """Check if required financial disclaimers are present."""
    required_phrases = [
        "not financial advice",
        "consult a qualified",
        "past performance",
    ]
    found = sum(1 for p in required_phrases if p.lower() in response.lower())
    score = round((found / len(required_phrases)) * 5)

    return {
        "disclaimer_score": score,
        "disclaimer_reason": f"Found {found}/{len(required_phrases)} required disclaimers"
    }

What’s happening:

  • Lines 3-16: A simple Python function that checks for required phrases in the response
  • Returns a score and reason, matching the evaluator output format
  • Code-based evaluators are deterministic — same input always gives same output (unlike LLM-based)
Scenario: Dr. Fatima builds a compliance evaluator

Dr. Fatima Al-Rashid at Meridian Financial needs their advice chatbot to meet banking regulations. James Chen (CISO) requires every financial response to include:

  1. A disclaimer that it’s not personalised financial advice
  2. A recommendation to consult a qualified advisor
  3. Risk warnings for any investment-related content

Dr. Fatima builds two custom evaluators:

  • Prompt-based: An LLM judge that scores overall regulatory tone (catches subtle compliance issues)
  • Code-based: A deterministic check for mandatory disclaimer phrases (guaranteed detection)

Both run alongside the standard quality and safety evaluators. The chatbot must pass ALL evaluators before deployment:

EvaluatorTypeThreshold
GroundednessBuilt-in qualityAt least 4.0
RelevanceBuilt-in qualityAt least 4.5
ViolenceBuilt-in safetyVery Low only
Hate/UnfairnessBuilt-in safetyVery Low only
Compliance ScoreCustom prompt-basedAt least 4.0
Disclaimer CheckCustom code-basedAt least 4.0

Automated evaluation workflows

Evaluation shouldn’t be a manual step. Integrate it into your CI/CD pipeline so every model update is automatically evaluated before deployment.

# .github/workflows/evaluate-model.yml
name: GenAI Evaluation Gate
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'config/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation suite
        run: |
          python run_evaluation.py \
            --dataset eval_dataset.jsonl \
            --thresholds groundedness=4.0,relevance=4.5,safety=low

      - name: Gate check
        run: |
          python check_thresholds.py --results evaluation_results.json
          # Exits with code 1 if any threshold fails → blocks the PR

What’s happening:

  • Lines 3-7: Trigger evaluation whenever prompts or config files change in a PR
  • Lines 16-19: Run the full evaluation suite against the test dataset
  • Lines 21-24: A gate script checks results against thresholds — if anything fails, the PR is blocked
💡 Exam tip: Always run quality AND safety

The exam expects you to know that quality and safety evaluations are separate concerns that must both pass. A common trap is a scenario where only quality metrics are checked.

The correct evaluation pipeline always includes:

  1. Quality metrics (groundedness, relevance, coherence, fluency)
  2. Safety metrics (hate, violence, self-harm, sexual)
  3. Custom metrics (if domain requires them)

All three layers must pass before deployment.

Key terms flashcards

Question

What are the four safety evaluation categories?

Click or press Enter to reveal answer

Answer

Hate and unfairness, Violence, Self-harm, Sexual content. Each is rated by severity level (Very Low / Low / Medium / High), not a 1-5 quality score.

Click to flip back

Question

Built-in vs custom evaluators — when to use each?

Click or press Enter to reveal answer

Answer

Built-in: standard quality (groundedness, relevance, coherence, fluency) and safety (4 categories). Custom: domain-specific needs like regulatory compliance, clinical accuracy, or brand voice. Use BOTH together.

Click to flip back

Question

What are the two types of custom evaluators?

Click or press Enter to reveal answer

Answer

Class-based callable (Python class with __call__ — deterministic or LLM-powered, reusable, can hold config) and Function-based (simple Python function — deterministic, guaranteed detection, less nuanced). Best practice: use both together for comprehensive coverage.

Click to flip back

Question

How do you automate evaluation in CI/CD?

Click or press Enter to reveal answer

Answer

Trigger evaluation on PR changes to prompts/config, run the evaluation suite against a test dataset, check results against thresholds, and block the PR if any threshold fails.

Click to flip back

Knowledge check

Knowledge Check

Meridian's chatbot scores 4.8 on groundedness and 4.9 on relevance, but the safety evaluation flags a 'Medium' severity for hate/unfairness on 3 of 200 test cases. Should Dr. Fatima approve deployment?

Knowledge Check

Zara needs to verify that Atlas's chatbot always includes a confidentiality notice when discussing client projects. Which evaluation approach should she use?

🎬 Video coming soon


Next up: Monitoring GenAI in Production — keeping your live system healthy after deployment.

← Previous

Evaluation: Datasets, Metrics & Quality Gates

Next →

Monitoring GenAI in Production

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.