Safety Evaluations & Custom Metrics

Why safety evaluation?

Simple explanation

Safety evaluation is checking for allergens in food, not just whether it tastes good.

A restaurant can serve the most delicious peanut curry in the world — but if a customer has a peanut allergy, delicious doesn’t matter. You need allergen checks SEPARATE from taste tests.

GenAI safety works the same way. Your chatbot might give accurate, relevant, fluent answers — but if one in a thousand responses contains harmful content, that’s a crisis. Safety evaluations catch the “allergens” that quality metrics miss.

Safety metric categories

Azure AI Foundry evaluates four categories of content risk:

Category	What It Detects	Severity Levels	Example
Hate and unfairness	Discriminatory content targeting protected groups	Very Low, Low, Medium, High	Biased hiring recommendations based on ethnicity
Violence	Content promoting or describing violence	Very Low, Low, Medium, High	Instructions for causing physical harm
Self-harm	Content encouraging self-destructive behaviour	Very Low, Low, Medium, High	Romanticising or instructing harmful behaviour
Sexual content	Inappropriate sexual content	Very Low, Low, Medium, High	Explicit content in a professional chatbot

Each category outputs a severity level (not a 1-5 score). Your quality gate defines which severity levels are acceptable for your application.

Exam tip: Safety severity vs quality scores

Don’t confuse the two scoring systems:

Quality metrics (groundedness, relevance, coherence, fluency): scored 1-5
Safety metrics (hate, violence, self-harm, sexual): severity levels (Very Low / Low / Medium / High)

The exam may present a scenario mixing both. Remember: a response must pass BOTH quality AND safety thresholds to be acceptable.

Configuring safety evaluators

from azure.ai.evaluation import (
    ViolenceEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    SexualEvaluator,
)

# Each evaluator uses an AI judge model
violence_eval = ViolenceEvaluator(azure_ai_project=project_scope)
self_harm_eval = SelfHarmEvaluator(azure_ai_project=project_scope)
hate_eval = HateUnfairnessEvaluator(azure_ai_project=project_scope)
sexual_eval = SexualEvaluator(azure_ai_project=project_scope)

# Evaluate a single response
result = violence_eval(
    query="How do I handle a difficult customer?",
    response="Here's a professional de-escalation approach..."
)
# result: {"violence": "Very low", "violence_score": 0, "violence_reason": "..."}

What’s happening:

Lines 1-6: Import the four safety evaluator classes
Lines 9-12: Create evaluator instances pointing to your Azure AI Foundry project (which hosts the judge model)
Lines 15-18: Evaluate a single query-response pair
Line 20: Result includes severity label, numeric score (0-7), and reasoning

Running safety evaluations at scale

from azure.ai.evaluation import evaluate

# Combine quality + safety evaluators in one run
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "groundedness": groundedness_eval,
        "relevance": relevance_eval,
        "violence": violence_eval,
        "self_harm": self_harm_eval,
        "hate_unfairness": hate_eval,
        "sexual": sexual_eval,
    },
    evaluator_config={
        "violence": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "self_harm": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "hate_unfairness": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
        "sexual": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

# Check for any high-severity findings
safety_flags = [r for r in results["rows"] if r.get("violence_score", 0) >= 5]

What’s happening:

Lines 4-31: Run quality AND safety evaluators together against the full dataset
Lines 14-30: Safety evaluators only need query and response (no context or ground truth)
Line 34: Filter for high-severity safety findings (score 5+ out of 7 is Medium-High)

Built-in vs custom metrics

Built-in metrics vs custom metrics
Feature	Scope	Setup Time	Maintenance	Best For
Built-in Metrics	General quality + safety (8 evaluators)	Minutes — import and configure	None — Microsoft maintains them	Standard GenAI quality gates
Custom Metrics	Domain-specific requirements	Hours — write evaluator logic	You maintain and update	Regulated industries, specialised domains

Building custom evaluation metrics

Sometimes the built-in metrics aren’t enough. A financial chatbot needs to check regulatory compliance. A medical chatbot needs to verify drug interaction warnings. These need custom evaluators.

Custom evaluator as a callable class

from azure.ai.evaluation import evaluate

# Custom evaluator as a callable class
class ComplianceEvaluator:
    def __init__(self, model_config):
        self.model_config = model_config

    def __call__(self, *, query, response, **kwargs):
        # Custom logic to check financial compliance
        # Returns a dict with score and reasoning
        has_disclaimer = "not financial advice" in response.lower()
        mentions_risk = any(w in response.lower() for w in ["risk", "past performance"])
        score = 1.0 if (has_disclaimer and mentions_risk) else 0.0
        return {"compliance_score": score}

# Use in an evaluation run
results = evaluate(
    data="eval_dataset.jsonl",
    evaluators={
        "compliance": ComplianceEvaluator(model_config=model_config),
    },
    evaluator_config={
        "compliance": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
            }
        },
    },
)

What’s happening:

Lines 4-16: A callable class that implements custom evaluation logic. The __call__ method receives keyword arguments matching the column mapping.
The evaluator checks for required disclaimer phrases and returns a score (deterministic, no LLM needed)
Lines 19-32: The custom evaluator plugs into evaluate() just like built-in evaluators, with column_mapping in evaluator_config
This pattern lets you evaluate anything: tone, brand voice, regulatory compliance, clinical accuracy

Code-based custom evaluator

import re

def disclaimer_check(response: str, **kwargs) -> dict:
    """Check if required financial disclaimers are present."""
    required_phrases = [
        "not financial advice",
        "consult a qualified",
        "past performance",
    ]
    found = sum(1 for p in required_phrases if p.lower() in response.lower())
    score = round((found / len(required_phrases)) * 5)

    return {
        "disclaimer_score": score,
        "disclaimer_reason": f"Found {found}/{len(required_phrases)} required disclaimers"
    }

What’s happening:

Lines 3-16: A simple Python function that checks for required phrases in the response
Returns a score and reason, matching the evaluator output format
Code-based evaluators are deterministic — same input always gives same output (unlike LLM-based)

Scenario: Dr. Fatima builds a compliance evaluator

Dr. Fatima Al-Rashid at Meridian Financial needs their advice chatbot to meet banking regulations. James Chen (CISO) requires every financial response to include:

A disclaimer that it’s not personalised financial advice
A recommendation to consult a qualified advisor
Risk warnings for any investment-related content

Dr. Fatima builds two custom evaluators:

Prompt-based: An LLM judge that scores overall regulatory tone (catches subtle compliance issues)
Code-based: A deterministic check for mandatory disclaimer phrases (guaranteed detection)

Both run alongside the standard quality and safety evaluators. The chatbot must pass ALL evaluators before deployment:

Evaluator	Type	Threshold
Groundedness	Built-in quality	At least 4.0
Relevance	Built-in quality	At least 4.5
Violence	Built-in safety	Very Low only
Hate/Unfairness	Built-in safety	Very Low only
Compliance Score	Custom prompt-based	At least 4.0
Disclaimer Check	Custom code-based	At least 4.0

Automated evaluation workflows

Evaluation shouldn’t be a manual step. Integrate it into your CI/CD pipeline so every model update is automatically evaluated before deployment.

# .github/workflows/evaluate-model.yml
name: GenAI Evaluation Gate
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'config/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation suite
        run: |
          python run_evaluation.py \
            --dataset eval_dataset.jsonl \
            --thresholds groundedness=4.0,relevance=4.5,safety=low

      - name: Gate check
        run: |
          python check_thresholds.py --results evaluation_results.json
          # Exits with code 1 if any threshold fails → blocks the PR

What’s happening:

Lines 3-7: Trigger evaluation whenever prompts or config files change in a PR
Lines 16-19: Run the full evaluation suite against the test dataset
Lines 21-24: A gate script checks results against thresholds — if anything fails, the PR is blocked

Exam tip: Always run quality AND safety

The exam expects you to know that quality and safety evaluations are separate concerns that must both pass. A common trap is a scenario where only quality metrics are checked.

The correct evaluation pipeline always includes:

Quality metrics (groundedness, relevance, coherence, fluency)
Safety metrics (hate, violence, self-harm, sexual)
Custom metrics (if domain requires them)

All three layers must pass before deployment.

Key terms flashcards

Question

What are the four safety evaluation categories?

Click or press Enter to reveal answer

Answer

Hate and unfairness, Violence, Self-harm, Sexual content. Each is rated by severity level (Very Low / Low / Medium / High), not a 1-5 quality score.

Click to flip back

Question

Built-in vs custom evaluators — when to use each?

Click or press Enter to reveal answer

Answer

Built-in: standard quality (groundedness, relevance, coherence, fluency) and safety (4 categories). Custom: domain-specific needs like regulatory compliance, clinical accuracy, or brand voice. Use BOTH together.

Click to flip back

Question

What are the two types of custom evaluators?

Click or press Enter to reveal answer

Answer

Class-based callable (Python class with __call__ — deterministic or LLM-powered, reusable, can hold config) and Function-based (simple Python function — deterministic, guaranteed detection, less nuanced). Best practice: use both together for comprehensive coverage.

Click to flip back

Question

How do you automate evaluation in CI/CD?

Click or press Enter to reveal answer

Answer

Trigger evaluation on PR changes to prompts/config, run the evaluation suite against a test dataset, check results against thresholds, and block the PR if any threshold fails.

Click to flip back

Knowledge check

Knowledge Check

Meridian's chatbot scores 4.8 on groundedness and 4.9 on relevance, but the safety evaluation flags a 'Medium' severity for hate/unfairness on 3 of 200 test cases. Should Dr. Fatima approve deployment?

Knowledge Check

Zara needs to verify that Atlas's chatbot always includes a confidentiality notice when discussing client projects. Which evaluation approach should she use?

Next up: Monitoring GenAI in Production — keeping your live system healthy after deployment.