Safety Evaluations & Custom Metrics
Quality isn't just about accuracy — it's about safety. Configure risk evaluations for harmful content, build custom metrics for domain-specific needs, and automate evaluation workflows.
Why safety evaluation?
Safety evaluation is checking for allergens in food, not just whether it tastes good.
A restaurant can serve the most delicious peanut curry in the world — but if a customer has a peanut allergy, delicious doesn’t matter. You need allergen checks SEPARATE from taste tests.
GenAI safety works the same way. Your chatbot might give accurate, relevant, fluent answers — but if one in a thousand responses contains harmful content, that’s a crisis. Safety evaluations catch the “allergens” that quality metrics miss.
Safety metric categories
Azure AI Foundry evaluates four categories of content risk:
| Category | What It Detects | Severity Levels | Example |
|---|---|---|---|
| Hate and unfairness | Discriminatory content targeting protected groups | Very Low, Low, Medium, High | Biased hiring recommendations based on ethnicity |
| Violence | Content promoting or describing violence | Very Low, Low, Medium, High | Instructions for causing physical harm |
| Self-harm | Content encouraging self-destructive behaviour | Very Low, Low, Medium, High | Romanticising or instructing harmful behaviour |
| Sexual content | Inappropriate sexual content | Very Low, Low, Medium, High | Explicit content in a professional chatbot |
Each category outputs a severity level (not a 1-5 score). Your quality gate defines which severity levels are acceptable for your application.
Exam tip: Safety severity vs quality scores
Don’t confuse the two scoring systems:
- Quality metrics (groundedness, relevance, coherence, fluency): scored 1-5
- Safety metrics (hate, violence, self-harm, sexual): severity levels (Very Low / Low / Medium / High)
The exam may present a scenario mixing both. Remember: a response must pass BOTH quality AND safety thresholds to be acceptable.
Configuring safety evaluators
from azure.ai.evaluation import (
ViolenceEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
SexualEvaluator,
)
# Each evaluator uses an AI judge model
violence_eval = ViolenceEvaluator(azure_ai_project=project_scope)
self_harm_eval = SelfHarmEvaluator(azure_ai_project=project_scope)
hate_eval = HateUnfairnessEvaluator(azure_ai_project=project_scope)
sexual_eval = SexualEvaluator(azure_ai_project=project_scope)
# Evaluate a single response
result = violence_eval(
query="How do I handle a difficult customer?",
response="Here's a professional de-escalation approach..."
)
# result: {"violence": "Very low", "violence_score": 0, "violence_reason": "..."}
What’s happening:
- Lines 1-6: Import the four safety evaluator classes
- Lines 9-12: Create evaluator instances pointing to your Azure AI Foundry project (which hosts the judge model)
- Lines 15-18: Evaluate a single query-response pair
- Line 20: Result includes severity label, numeric score (0-7), and reasoning
Running safety evaluations at scale
from azure.ai.evaluation import evaluate
# Combine quality + safety evaluators in one run
results = evaluate(
data="eval_dataset.jsonl",
evaluators={
"groundedness": groundedness_eval,
"relevance": relevance_eval,
"violence": violence_eval,
"self_harm": self_harm_eval,
"hate_unfairness": hate_eval,
"sexual": sexual_eval,
},
evaluator_config={
"violence": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
"self_harm": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
"hate_unfairness": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
"sexual": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
},
)
# Check for any high-severity findings
safety_flags = [r for r in results["rows"] if r.get("violence_score", 0) >= 5]
What’s happening:
- Lines 4-31: Run quality AND safety evaluators together against the full dataset
- Lines 14-30: Safety evaluators only need query and response (no context or ground truth)
- Line 34: Filter for high-severity safety findings (score 5+ out of 7 is Medium-High)
Built-in vs custom metrics
| Feature | Scope | Setup Time | Maintenance | Best For |
|---|---|---|---|---|
| Built-in Metrics | General quality + safety (8 evaluators) | Minutes — import and configure | None — Microsoft maintains them | Standard GenAI quality gates |
| Custom Metrics | Domain-specific requirements | Hours — write evaluator logic | You maintain and update | Regulated industries, specialised domains |
Building custom evaluation metrics
Sometimes the built-in metrics aren’t enough. A financial chatbot needs to check regulatory compliance. A medical chatbot needs to verify drug interaction warnings. These need custom evaluators.
Custom evaluator as a callable class
from azure.ai.evaluation import evaluate
# Custom evaluator as a callable class
class ComplianceEvaluator:
def __init__(self, model_config):
self.model_config = model_config
def __call__(self, *, query, response, **kwargs):
# Custom logic to check financial compliance
# Returns a dict with score and reasoning
has_disclaimer = "not financial advice" in response.lower()
mentions_risk = any(w in response.lower() for w in ["risk", "past performance"])
score = 1.0 if (has_disclaimer and mentions_risk) else 0.0
return {"compliance_score": score}
# Use in an evaluation run
results = evaluate(
data="eval_dataset.jsonl",
evaluators={
"compliance": ComplianceEvaluator(model_config=model_config),
},
evaluator_config={
"compliance": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
},
},
)
What’s happening:
- Lines 4-16: A callable class that implements custom evaluation logic. The
__call__method receives keyword arguments matching the column mapping. - The evaluator checks for required disclaimer phrases and returns a score (deterministic, no LLM needed)
- Lines 19-32: The custom evaluator plugs into
evaluate()just like built-in evaluators, withcolumn_mappinginevaluator_config - This pattern lets you evaluate anything: tone, brand voice, regulatory compliance, clinical accuracy
Code-based custom evaluator
import re
def disclaimer_check(response: str, **kwargs) -> dict:
"""Check if required financial disclaimers are present."""
required_phrases = [
"not financial advice",
"consult a qualified",
"past performance",
]
found = sum(1 for p in required_phrases if p.lower() in response.lower())
score = round((found / len(required_phrases)) * 5)
return {
"disclaimer_score": score,
"disclaimer_reason": f"Found {found}/{len(required_phrases)} required disclaimers"
}
What’s happening:
- Lines 3-16: A simple Python function that checks for required phrases in the response
- Returns a score and reason, matching the evaluator output format
- Code-based evaluators are deterministic — same input always gives same output (unlike LLM-based)
Scenario: Dr. Fatima builds a compliance evaluator
Dr. Fatima Al-Rashid at Meridian Financial needs their advice chatbot to meet banking regulations. James Chen (CISO) requires every financial response to include:
- A disclaimer that it’s not personalised financial advice
- A recommendation to consult a qualified advisor
- Risk warnings for any investment-related content
Dr. Fatima builds two custom evaluators:
- Prompt-based: An LLM judge that scores overall regulatory tone (catches subtle compliance issues)
- Code-based: A deterministic check for mandatory disclaimer phrases (guaranteed detection)
Both run alongside the standard quality and safety evaluators. The chatbot must pass ALL evaluators before deployment:
| Evaluator | Type | Threshold |
|---|---|---|
| Groundedness | Built-in quality | At least 4.0 |
| Relevance | Built-in quality | At least 4.5 |
| Violence | Built-in safety | Very Low only |
| Hate/Unfairness | Built-in safety | Very Low only |
| Compliance Score | Custom prompt-based | At least 4.0 |
| Disclaimer Check | Custom code-based | At least 4.0 |
Automated evaluation workflows
Evaluation shouldn’t be a manual step. Integrate it into your CI/CD pipeline so every model update is automatically evaluated before deployment.
# .github/workflows/evaluate-model.yml
name: GenAI Evaluation Gate
on:
pull_request:
paths:
- 'prompts/**'
- 'config/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluation suite
run: |
python run_evaluation.py \
--dataset eval_dataset.jsonl \
--thresholds groundedness=4.0,relevance=4.5,safety=low
- name: Gate check
run: |
python check_thresholds.py --results evaluation_results.json
# Exits with code 1 if any threshold fails → blocks the PR
What’s happening:
- Lines 3-7: Trigger evaluation whenever prompts or config files change in a PR
- Lines 16-19: Run the full evaluation suite against the test dataset
- Lines 21-24: A gate script checks results against thresholds — if anything fails, the PR is blocked
Exam tip: Always run quality AND safety
The exam expects you to know that quality and safety evaluations are separate concerns that must both pass. A common trap is a scenario where only quality metrics are checked.
The correct evaluation pipeline always includes:
- Quality metrics (groundedness, relevance, coherence, fluency)
- Safety metrics (hate, violence, self-harm, sexual)
- Custom metrics (if domain requires them)
All three layers must pass before deployment.
Key terms flashcards
Knowledge check
Meridian's chatbot scores 4.8 on groundedness and 4.9 on relevance, but the safety evaluation flags a 'Medium' severity for hate/unfairness on 3 of 200 test cases. Should Dr. Fatima approve deployment?
Zara needs to verify that Atlas's chatbot always includes a confidentiality notice when discussing client projects. Which evaluation approach should she use?
🎬 Video coming soon
Next up: Monitoring GenAI in Production — keeping your live system healthy after deployment.