Evaluating AI Models & Apps

Why evaluation matters

Simple explanation

Evaluation is like a quality inspection for your AI — you wouldn’t ship a product without testing it, and you shouldn’t deploy an AI model without evaluating it.

AI models can fabricate information (hallucinate), give irrelevant answers, produce unsafe content, or simply be bad at the task. Evaluation tells you which of these problems exist and how severe they are — before your users find out.

Evaluation dimensions

Dimension	What It Measures	Evaluator	Score Range
Groundedness	Is the response based on provided context?	GroundednessEvaluator	1-5
Relevance	Does it answer the actual question?	RelevanceEvaluator	1-5
Coherence	Is the response logical and well-structured?	CoherenceEvaluator	1-5
Fluency	Is the language natural and readable?	FluencyEvaluator	1-5
Safety	Is the content free from harmful material?	ContentSafetyEvaluator	Pass/Fail
F1 Score	Does extraction output match expected fields?	F1ScoreEvaluator	0-1

Detecting fabrications

Fabrication (hallucination) detection is the most exam-relevant evaluation:

Type	What Happens	Example	Detection Method
Factual fabrication	Model invents facts	”The policy was updated in March 2025” (it wasn’t)	Groundedness evaluator against source docs
Citation fabrication	Model invents references	”According to Regulation 45.2.1” (doesn’t exist)	Provenance checking against index
Confident fabrication	Model states guesses as facts	”This will definitely work because…”	Calibration evaluation — certainty vs accuracy

Exam tip: Groundedness vs relevance

These are different evaluation dimensions:

Groundedness = “Is the answer based on the retrieved data?” (factual accuracy)
Relevance = “Does the answer address what the user asked?” (topical accuracy)

A response can be grounded (all facts from source docs) but irrelevant (answering a different question than what was asked). Both matter.

Evaluation workflows

When	Method	Purpose
Development	Manual evaluation with test datasets	Iterate on prompts and RAG configuration
CI/CD pipeline	Automated evaluators on every PR	Gate deployments on quality thresholds
Pre-launch	Red teaming with adversarial inputs	Find safety gaps before users do
Production	Continuous monitoring with sampling	Detect drift and emerging quality issues

Building an evaluation dataset

Component	What It Contains	Example
Input	User question	”What is the refund policy for damaged goods?”
Context	Retrieved documents (for RAG)	Refund policy document, Section 3.2
Expected output	Ground truth answer	”Damaged goods can be returned within 30 days for a full refund”
Metadata	Difficulty, category, source	Difficulty: medium, Category: refund, Source: policy_v3.pdf

Real-world example: MediaForge's evaluation pipeline

MediaForge evaluates their content generation model before every deployment:

Test dataset: 200 content briefs with expected outputs
Evaluators: Coherence (threshold: 4.0), Fluency (4.0), Relevance (4.0), Safety (must pass)
CI/CD gate: If any evaluator falls below threshold, deployment is blocked
Red teaming: Monthly adversarial testing — prompt injection, brand-unsafe content generation
Production monitoring: Sample 5% of responses daily, run all evaluators

Result: they caught a coherence regression when updating from GPT-4o to GPT-4.1 — the new model produced shorter responses that scored lower on completeness. They adjusted the prompt before deploying.

Key terms

Question

What is a fabrication (hallucination) in AI?

Click or press Enter to reveal answer

Answer

When the model generates information that is not supported by the provided source data or is factually incorrect. Includes inventing facts, citations, and confidently stating guesses as certainties.

Click to flip back

Question

What is the GroundednessEvaluator?

Click or press Enter to reveal answer

Answer

A Foundry evaluation tool that measures whether the model's response is factually based on the retrieved context documents. Scores 1-5, where 5 means fully grounded in provided data.

Click to flip back

Question

What is an evaluation dataset?

Click or press Enter to reveal answer

Answer

A curated set of test cases containing input questions, expected outputs, and optionally retrieved context. Used to systematically measure model quality across multiple dimensions before deployment.

Click to flip back

Question

What is red teaming for AI?

Click or press Enter to reveal answer

Answer

Adversarial testing where testers deliberately try to make the AI produce unsafe, incorrect, or unexpected outputs. Tests prompt injection, jailbreaks, edge cases, and bias. Run before launch and periodically in production.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot scores 4.5 on fluency and 4.2 on coherence, but only 2.1 on groundedness. What does this tell you?

Knowledge Check

Atlas Financial wants to ensure no AI model deployment happens without passing quality checks. Where should they add evaluation?