Evaluating AI Models & Apps
How do you know if your AI is actually good? Learn how to evaluate models for fabrications, relevance, quality, and safety — and build evaluation into your development workflow.
Why evaluation matters
Evaluation is like a quality inspection for your AI — you wouldn’t ship a product without testing it, and you shouldn’t deploy an AI model without evaluating it.
AI models can fabricate information (hallucinate), give irrelevant answers, produce unsafe content, or simply be bad at the task. Evaluation tells you which of these problems exist and how severe they are — before your users find out.
Evaluation dimensions
| Dimension | What It Measures | Evaluator | Score Range |
|---|---|---|---|
| Groundedness | Is the response based on provided context? | GroundednessEvaluator | 1-5 |
| Relevance | Does it answer the actual question? | RelevanceEvaluator | 1-5 |
| Coherence | Is the response logical and well-structured? | CoherenceEvaluator | 1-5 |
| Fluency | Is the language natural and readable? | FluencyEvaluator | 1-5 |
| Safety | Is the content free from harmful material? | ContentSafetyEvaluator | Pass/Fail |
| F1 Score | Does extraction output match expected fields? | F1ScoreEvaluator | 0-1 |
Detecting fabrications
Fabrication (hallucination) detection is the most exam-relevant evaluation:
| Type | What Happens | Example | Detection Method |
|---|---|---|---|
| Factual fabrication | Model invents facts | ”The policy was updated in March 2025” (it wasn’t) | Groundedness evaluator against source docs |
| Citation fabrication | Model invents references | ”According to Regulation 45.2.1” (doesn’t exist) | Provenance checking against index |
| Confident fabrication | Model states guesses as facts | ”This will definitely work because…” | Calibration evaluation — certainty vs accuracy |
Exam tip: Groundedness vs relevance
These are different evaluation dimensions:
- Groundedness = “Is the answer based on the retrieved data?” (factual accuracy)
- Relevance = “Does the answer address what the user asked?” (topical accuracy)
A response can be grounded (all facts from source docs) but irrelevant (answering a different question than what was asked). Both matter.
Evaluation workflows
| When | Method | Purpose |
|---|---|---|
| Development | Manual evaluation with test datasets | Iterate on prompts and RAG configuration |
| CI/CD pipeline | Automated evaluators on every PR | Gate deployments on quality thresholds |
| Pre-launch | Red teaming with adversarial inputs | Find safety gaps before users do |
| Production | Continuous monitoring with sampling | Detect drift and emerging quality issues |
Building an evaluation dataset
| Component | What It Contains | Example |
|---|---|---|
| Input | User question | ”What is the refund policy for damaged goods?” |
| Context | Retrieved documents (for RAG) | Refund policy document, Section 3.2 |
| Expected output | Ground truth answer | ”Damaged goods can be returned within 30 days for a full refund” |
| Metadata | Difficulty, category, source | Difficulty: medium, Category: refund, Source: policy_v3.pdf |
Real-world example: MediaForge's evaluation pipeline
MediaForge evaluates their content generation model before every deployment:
- Test dataset: 200 content briefs with expected outputs
- Evaluators: Coherence (threshold: 4.0), Fluency (4.0), Relevance (4.0), Safety (must pass)
- CI/CD gate: If any evaluator falls below threshold, deployment is blocked
- Red teaming: Monthly adversarial testing — prompt injection, brand-unsafe content generation
- Production monitoring: Sample 5% of responses daily, run all evaluators
Result: they caught a coherence regression when updating from GPT-4o to GPT-4.1 — the new model produced shorter responses that scored lower on completeness. They adjusted the prompt before deploying.
Key terms
Knowledge check
NeuralMed's patient chatbot scores 4.5 on fluency and 4.2 on coherence, but only 2.1 on groundedness. What does this tell you?
Atlas Financial wants to ensure no AI model deployment happens without passing quality checks. Where should they add evaluation?
🎬 Video coming soon