Reviewing Results & Tuning Performance

Simple explanation

From Scores to Action

Running a test set gives you numbers. Numbers alone don’t fix anything — you need to interpret them, spot patterns, and take targeted action. This module is about turning raw evaluation results into agent improvements.

🤖 Lena’s scenario: Lena is the AI engineer responsible for a healthcare agent that helps clinic staff check drug interactions, dosage guidelines, and formulary status. After her latest evaluation run, the agent scored 91% on response quality but only 68% on accuracy. Something is wrong — and she needs to find out what.

Reading the Results Dashboard

When an evaluation completes, you see several layers of information:

Aggregate scores

The top-level view shows overall pass rates and average scores for each evaluation method you selected. Think of these as your agent’s GPA — they tell you the big picture but not the details.

Lena’s dashboard shows:

Accuracy: 68% (below her 85% threshold)
Grounding quality: 82% (acceptable but not great)
Response quality: 91% (strong)

The response quality score is misleading on its own — the agent sounds confident and helpful even when giving wrong answers. This is a classic pattern: fluency masks factual errors.

Per-test-case breakdown

Drill into individual test cases to see exactly which questions the agent got wrong. Each failed case shows the input, the expected output, the actual response, and the score per method.

Lena filters to accuracy failures and sorts by score (lowest first). A cluster of failures jumps out — 12 of the 15 accuracy failures involve drug interaction queries.

Domain-level grouping

If your test cases are tagged by category or topic, you can see which domains perform well and which struggle. This is enormously helpful for large test sets where scanning individual cases would take hours.

Lena’s test cases are tagged by query type: dosage, formulary, interactions, and general. The interaction category has a 45% accuracy rate while everything else is above 85%. The problem is localized.

Question

Why can a high 'response quality' score be misleading when 'accuracy' is low?

Click or press Enter to reveal answer

Answer

Because response quality measures how helpful, clear, and well-structured the answer sounds — not whether it's factually correct. An agent can give a confident, well-formatted wrong answer and score high on quality but low on accuracy. Always check accuracy alongside quality.

Click to flip back

Common Failure Patterns

After reviewing hundreds of agent evaluations, certain failure patterns appear again and again. Recognizing these patterns lets you skip straight to the likely root cause.

Wrong topic routing

The agent sends the query to the wrong topic, producing a response that’s on-topic for something else entirely. Symptom: low topic matching scores on specific query types.

Fix: Adjust trigger phrases on the affected topics. Add more example phrases to the correct topic and add negative examples (phrases that should NOT trigger it) to the misrouted topic.

Poor grounding (hallucination)

The agent generates an answer that sounds plausible but isn’t supported by any knowledge source. Symptom: low grounding quality scores even when accuracy might be partially correct.

Fix: Check if the relevant knowledge source is actually uploaded and indexed. Verify the agent’s instructions tell it to use knowledge sources. Consider adding explicit instructions like “only answer drug interaction questions based on the uploaded formulary document.”

Outdated knowledge

The agent references information that was correct at one point but has since changed. Symptom: accuracy failures clustered around a specific data domain.

This is exactly Lena’s problem. The drug interaction data in the knowledge base is from six months ago. Three medications have been added to the formulary since then, and two interactions have been reclassified. The agent is grounding correctly (using the uploaded data) but the data itself is stale.

Fix: Update the knowledge source. Set a recurring reminder to refresh data on a schedule.

Missing knowledge coverage

The agent simply doesn’t have information to answer certain questions, so it either refuses (“I don’t know”) or hallucinates. Symptom: failures concentrated in a specific domain with either empty responses or low grounding.

Fix: Upload additional knowledge sources covering the gap. Alternatively, add a topic that gracefully handles out-of-scope queries with a helpful redirect.

Question

What failure pattern causes correct grounding scores but low accuracy scores?

Click or press Enter to reveal answer

Answer

Outdated knowledge. The agent is correctly grounding its answers in the uploaded data (high grounding score), but the data itself is stale or incorrect, leading to factually wrong answers (low accuracy). The fix is to update the knowledge source.

Click to flip back

Question

An agent routes 'What is the refund policy?' to the shipping topic instead of the returns topic. What is this failure pattern, and how do you fix it?

Click or press Enter to reveal answer

Answer

Wrong topic routing. Fix by adjusting trigger phrases — add 'refund policy' examples to the returns topic's triggers, and consider adding it as a negative example on the shipping topic to prevent misrouting.

Click to flip back

The Iterative Improvement Workflow

Improving an agent isn’t a one-shot fix. It’s a cycle:

Review — Analyze evaluation results, identify the lowest-scoring areas
Identify — Classify the failure pattern (wrong topic, poor grounding, outdated knowledge, missing coverage)
Fix — Apply the targeted remedy for that pattern
Re-evaluate — Run the same test set again and compare to the previous snapshot
Repeat — If scores improved but haven’t reached threshold, continue the cycle

The key discipline is changing only one thing at a time. If you update knowledge sources AND rewrite instructions AND change topic triggers simultaneously, you won’t know which change helped (or hurt).

Lena’s improvement cycle

Cycle 1: Lena updates the formulary document with current drug interaction data. Re-evaluation: accuracy jumps from 68% to 79%. Progress, but still below the 85% threshold.

Cycle 2: She examines the remaining failures. Five cases involve newly approved medications that aren’t in any uploaded document yet. She adds a supplementary knowledge file covering the new approvals. Re-evaluation: accuracy reaches 84%.

Cycle 3: One last cluster of failures — the agent provides generic interaction warnings instead of severity-specific ones. Lena adds an instruction: “When reporting drug interactions, always include the severity level (major, moderate, minor) from the formulary.” Re-evaluation: accuracy hits 88%. Threshold met.

Three cycles. Three targeted fixes. Each one addressing a specific failure pattern identified from the data.

Question

Why should you change only one thing at a time during the iterative improvement cycle?

Click or press Enter to reveal answer

Answer

Because if you change multiple things simultaneously (knowledge, instructions, topics), you won't know which change caused the improvement or regression. Single-variable changes let you attribute results to specific fixes and build reliable understanding of what works.

Click to flip back

Failure Pattern	Key Symptom	Root Cause	Fix
Wrong topic routing	Low topic matching on specific queries	Insufficient or ambiguous trigger phrases	Add/refine trigger phrases and negative examples
Hallucination	Low grounding quality scores	Missing knowledge or weak grounding instructions	Upload relevant data, strengthen grounding instructions
Outdated knowledge	Low accuracy with good grounding	Stale data in knowledge sources	Update knowledge, set refresh schedule
Missing coverage	Failures in a specific domain	No knowledge source for that area	Upload new knowledge or add graceful fallback topic

Knowledge Check

Lena's healthcare agent scores 82% on grounding quality but only 68% on accuracy for drug interaction queries. The knowledge source was last updated six months ago. What is the most likely failure pattern?

Knowledge Check

After updating knowledge sources, Lena re-runs her test set and accuracy improves from 68% to 79%. What should she do next?

Knowledge Check

Which combination of evaluation scores would MOST strongly suggest a hallucination problem?

Question

What are the five steps of the iterative improvement workflow?

Click or press Enter to reveal answer

Answer

1) Review — analyze evaluation results and find lowest-scoring areas. 2) Identify — classify the failure pattern. 3) Fix — apply the targeted remedy. 4) Re-evaluate — run the same test set and compare snapshots. 5) Repeat — continue until scores meet your threshold.

Click to flip back

Key Takeaways

Aggregate scores show the big picture; per-case breakdowns reveal exactly what failed and why
Four common failure patterns: wrong topic routing, hallucination, outdated knowledge, and missing coverage
Each pattern has a distinct signature in evaluation scores — learn to read the pattern, not just the number
The iterative improvement cycle (review, identify, fix, re-evaluate, repeat) is the disciplined path to quality
Change one variable at a time to attribute improvements to specific fixes