Reviewing Results & Tuning Performance
Interpret evaluation results, identify failure patterns, and iteratively improve agent quality through data-driven tuning.
From Scores to Action
Running a test set gives you numbers. Numbers alone don’t fix anything — you need to interpret them, spot patterns, and take targeted action. This module is about turning raw evaluation results into agent improvements.
🤖 Lena’s scenario: Lena is the AI engineer responsible for a healthcare agent that helps clinic staff check drug interactions, dosage guidelines, and formulary status. After her latest evaluation run, the agent scored 91% on response quality but only 68% on accuracy. Something is wrong — and she needs to find out what.
Reading the Results Dashboard
When an evaluation completes, you see several layers of information:
Aggregate scores
The top-level view shows overall pass rates and average scores for each evaluation method you selected. Think of these as your agent’s GPA — they tell you the big picture but not the details.
Lena’s dashboard shows:
- Accuracy: 68% (below her 85% threshold)
- Grounding quality: 82% (acceptable but not great)
- Response quality: 91% (strong)
The response quality score is misleading on its own — the agent sounds confident and helpful even when giving wrong answers. This is a classic pattern: fluency masks factual errors.
Per-test-case breakdown
Drill into individual test cases to see exactly which questions the agent got wrong. Each failed case shows the input, the expected output, the actual response, and the score per method.
Lena filters to accuracy failures and sorts by score (lowest first). A cluster of failures jumps out — 12 of the 15 accuracy failures involve drug interaction queries.
Domain-level grouping
If your test cases are tagged by category or topic, you can see which domains perform well and which struggle. This is enormously helpful for large test sets where scanning individual cases would take hours.
Lena’s test cases are tagged by query type: dosage, formulary, interactions, and general. The interaction category has a 45% accuracy rate while everything else is above 85%. The problem is localized.
Common Failure Patterns
After reviewing hundreds of agent evaluations, certain failure patterns appear again and again. Recognizing these patterns lets you skip straight to the likely root cause.
Wrong topic routing
The agent sends the query to the wrong topic, producing a response that’s on-topic for something else entirely. Symptom: low topic matching scores on specific query types.
Fix: Adjust trigger phrases on the affected topics. Add more example phrases to the correct topic and add negative examples (phrases that should NOT trigger it) to the misrouted topic.
Poor grounding (hallucination)
The agent generates an answer that sounds plausible but isn’t supported by any knowledge source. Symptom: low grounding quality scores even when accuracy might be partially correct.
Fix: Check if the relevant knowledge source is actually uploaded and indexed. Verify the agent’s instructions tell it to use knowledge sources. Consider adding explicit instructions like “only answer drug interaction questions based on the uploaded formulary document.”
Outdated knowledge
The agent references information that was correct at one point but has since changed. Symptom: accuracy failures clustered around a specific data domain.
This is exactly Lena’s problem. The drug interaction data in the knowledge base is from six months ago. Three medications have been added to the formulary since then, and two interactions have been reclassified. The agent is grounding correctly (using the uploaded data) but the data itself is stale.
Fix: Update the knowledge source. Set a recurring reminder to refresh data on a schedule.
Missing knowledge coverage
The agent simply doesn’t have information to answer certain questions, so it either refuses (“I don’t know”) or hallucinates. Symptom: failures concentrated in a specific domain with either empty responses or low grounding.
Fix: Upload additional knowledge sources covering the gap. Alternatively, add a topic that gracefully handles out-of-scope queries with a helpful redirect.
The Iterative Improvement Workflow
Improving an agent isn’t a one-shot fix. It’s a cycle:
- Review — Analyze evaluation results, identify the lowest-scoring areas
- Identify — Classify the failure pattern (wrong topic, poor grounding, outdated knowledge, missing coverage)
- Fix — Apply the targeted remedy for that pattern
- Re-evaluate — Run the same test set again and compare to the previous snapshot
- Repeat — If scores improved but haven’t reached threshold, continue the cycle
The key discipline is changing only one thing at a time. If you update knowledge sources AND rewrite instructions AND change topic triggers simultaneously, you won’t know which change helped (or hurt).
Lena’s improvement cycle
Cycle 1: Lena updates the formulary document with current drug interaction data. Re-evaluation: accuracy jumps from 68% to 79%. Progress, but still below the 85% threshold.
Cycle 2: She examines the remaining failures. Five cases involve newly approved medications that aren’t in any uploaded document yet. She adds a supplementary knowledge file covering the new approvals. Re-evaluation: accuracy reaches 84%.
Cycle 3: One last cluster of failures — the agent provides generic interaction warnings instead of severity-specific ones. Lena adds an instruction: “When reporting drug interactions, always include the severity level (major, moderate, minor) from the formulary.” Re-evaluation: accuracy hits 88%. Threshold met.
Three cycles. Three targeted fixes. Each one addressing a specific failure pattern identified from the data.
| Failure Pattern | Key Symptom | Root Cause | Fix |
|---|---|---|---|
| Wrong topic routing | Low topic matching on specific queries | Insufficient or ambiguous trigger phrases | Add/refine trigger phrases and negative examples |
| Hallucination | Low grounding quality scores | Missing knowledge or weak grounding instructions | Upload relevant data, strengthen grounding instructions |
| Outdated knowledge | Low accuracy with good grounding | Stale data in knowledge sources | Update knowledge, set refresh schedule |
| Missing coverage | Failures in a specific domain | No knowledge source for that area | Upload new knowledge or add graceful fallback topic |
Lena's healthcare agent scores 82% on grounding quality but only 68% on accuracy for drug interaction queries. The knowledge source was last updated six months ago. What is the most likely failure pattern?
After updating knowledge sources, Lena re-runs her test set and accuracy improves from 68% to 79%. What should she do next?
Which combination of evaluation scores would MOST strongly suggest a hallucination problem?
Key Takeaways
- Aggregate scores show the big picture; per-case breakdowns reveal exactly what failed and why
- Four common failure patterns: wrong topic routing, hallucination, outdated knowledge, and missing coverage
- Each pattern has a distinct signature in evaluation scores — learn to read the pattern, not just the number
- The iterative improvement cycle (review, identify, fix, re-evaluate, repeat) is the disciplined path to quality
- Change one variable at a time to attribute improvements to specific fixes
🎬 Video coming soon
Reviewing Results & Tuning Performance — Walkthrough