🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901 aws-aif-c01
Guided AB-620 Domain 3
Domain 3 — Module 2 of 6 33%
24 of 28 overall

AB-620 Study Guide

Domain 1: Plan and Configure Agent Solutions

  • Getting Started: Copilot Studio for Developers Free
  • Planning Enterprise Integration and Reusable Components Free
  • Identity Strategy for Agents Free
  • Channels, Deployment and Audience Design Free
  • Responsible AI and Security Governance Free
  • Agent Flows: Build, Monitor and Handle Errors Free
  • Human-in-the-Loop Agent Flows Free
  • Topics, Tools and Variables Free
  • Advanced Responses: Custom Prompts and Generative Answers Free
  • API Calls, HTTP Requests and Adaptive Cards Free

Domain 2: Integrate and Extend Agents in Copilot Studio

  • Enterprise Knowledge Sources: The Big Picture
  • Copilot Connectors and Power Platform Connectors
  • Azure AI Search as a Knowledge Source
  • Adding Tools: Custom Connectors and REST APIs
  • MCP Tools: Model Context Protocol in Action
  • Computer Use: Agent-Driven UI Automation
  • Multi-Agent Solutions: Design and Agent Reuse
  • Integrating Foundry Agents
  • Fabric Data Agents: Analytics Meets AI
  • A2A Protocol: Cross-Platform Agent Collaboration
  • Grounded Answers: Azure AI Search with Foundry
  • Foundry Model Catalog and Application Insights

Domain 3: Test and Manage Agents

  • Test Sets & Evaluation Methods
  • Reviewing Results & Tuning Performance
  • Solutions & Environment Variables
  • Power Platform Pipelines for Agent ALM
  • Agent Lifecycle: From Dev to Production
  • Exam Prep: Diagnostic Review

AB-620 Study Guide

Domain 1: Plan and Configure Agent Solutions

  • Getting Started: Copilot Studio for Developers Free
  • Planning Enterprise Integration and Reusable Components Free
  • Identity Strategy for Agents Free
  • Channels, Deployment and Audience Design Free
  • Responsible AI and Security Governance Free
  • Agent Flows: Build, Monitor and Handle Errors Free
  • Human-in-the-Loop Agent Flows Free
  • Topics, Tools and Variables Free
  • Advanced Responses: Custom Prompts and Generative Answers Free
  • API Calls, HTTP Requests and Adaptive Cards Free

Domain 2: Integrate and Extend Agents in Copilot Studio

  • Enterprise Knowledge Sources: The Big Picture
  • Copilot Connectors and Power Platform Connectors
  • Azure AI Search as a Knowledge Source
  • Adding Tools: Custom Connectors and REST APIs
  • MCP Tools: Model Context Protocol in Action
  • Computer Use: Agent-Driven UI Automation
  • Multi-Agent Solutions: Design and Agent Reuse
  • Integrating Foundry Agents
  • Fabric Data Agents: Analytics Meets AI
  • A2A Protocol: Cross-Platform Agent Collaboration
  • Grounded Answers: Azure AI Search with Foundry
  • Foundry Model Catalog and Application Insights

Domain 3: Test and Manage Agents

  • Test Sets & Evaluation Methods
  • Reviewing Results & Tuning Performance
  • Solutions & Environment Variables
  • Power Platform Pipelines for Agent ALM
  • Agent Lifecycle: From Dev to Production
  • Exam Prep: Diagnostic Review
Domain 3: Test and Manage Agents Premium ⏱ ~12 min read

Reviewing Results & Tuning Performance

Interpret evaluation results, identify failure patterns, and iteratively improve agent quality through data-driven tuning.

☕ Simple explanation

From Scores to Action

Running a test set gives you numbers. Numbers alone don’t fix anything — you need to interpret them, spot patterns, and take targeted action. This module is about turning raw evaluation results into agent improvements.

🤖 Lena’s scenario: Lena is the AI engineer responsible for a healthcare agent that helps clinic staff check drug interactions, dosage guidelines, and formulary status. After her latest evaluation run, the agent scored 91% on response quality but only 68% on accuracy. Something is wrong — and she needs to find out what.

Reading the Results Dashboard

When an evaluation completes, you see several layers of information:

Aggregate scores

The top-level view shows overall pass rates and average scores for each evaluation method you selected. Think of these as your agent’s GPA — they tell you the big picture but not the details.

Lena’s dashboard shows:

  • Accuracy: 68% (below her 85% threshold)
  • Grounding quality: 82% (acceptable but not great)
  • Response quality: 91% (strong)

The response quality score is misleading on its own — the agent sounds confident and helpful even when giving wrong answers. This is a classic pattern: fluency masks factual errors.

Per-test-case breakdown

Drill into individual test cases to see exactly which questions the agent got wrong. Each failed case shows the input, the expected output, the actual response, and the score per method.

Lena filters to accuracy failures and sorts by score (lowest first). A cluster of failures jumps out — 12 of the 15 accuracy failures involve drug interaction queries.

Domain-level grouping

If your test cases are tagged by category or topic, you can see which domains perform well and which struggle. This is enormously helpful for large test sets where scanning individual cases would take hours.

Lena’s test cases are tagged by query type: dosage, formulary, interactions, and general. The interaction category has a 45% accuracy rate while everything else is above 85%. The problem is localized.

Question

Why can a high 'response quality' score be misleading when 'accuracy' is low?

Click or press Enter to reveal answer

Answer

Because response quality measures how helpful, clear, and well-structured the answer sounds — not whether it's factually correct. An agent can give a confident, well-formatted wrong answer and score high on quality but low on accuracy. Always check accuracy alongside quality.

Click to flip back

Common Failure Patterns

After reviewing hundreds of agent evaluations, certain failure patterns appear again and again. Recognizing these patterns lets you skip straight to the likely root cause.

Wrong topic routing

The agent sends the query to the wrong topic, producing a response that’s on-topic for something else entirely. Symptom: low topic matching scores on specific query types.

Fix: Adjust trigger phrases on the affected topics. Add more example phrases to the correct topic and add negative examples (phrases that should NOT trigger it) to the misrouted topic.

Poor grounding (hallucination)

The agent generates an answer that sounds plausible but isn’t supported by any knowledge source. Symptom: low grounding quality scores even when accuracy might be partially correct.

Fix: Check if the relevant knowledge source is actually uploaded and indexed. Verify the agent’s instructions tell it to use knowledge sources. Consider adding explicit instructions like “only answer drug interaction questions based on the uploaded formulary document.”

Outdated knowledge

The agent references information that was correct at one point but has since changed. Symptom: accuracy failures clustered around a specific data domain.

This is exactly Lena’s problem. The drug interaction data in the knowledge base is from six months ago. Three medications have been added to the formulary since then, and two interactions have been reclassified. The agent is grounding correctly (using the uploaded data) but the data itself is stale.

Fix: Update the knowledge source. Set a recurring reminder to refresh data on a schedule.

Missing knowledge coverage

The agent simply doesn’t have information to answer certain questions, so it either refuses (“I don’t know”) or hallucinates. Symptom: failures concentrated in a specific domain with either empty responses or low grounding.

Fix: Upload additional knowledge sources covering the gap. Alternatively, add a topic that gracefully handles out-of-scope queries with a helpful redirect.

Question

What failure pattern causes correct grounding scores but low accuracy scores?

Click or press Enter to reveal answer

Answer

Outdated knowledge. The agent is correctly grounding its answers in the uploaded data (high grounding score), but the data itself is stale or incorrect, leading to factually wrong answers (low accuracy). The fix is to update the knowledge source.

Click to flip back

Question

An agent routes 'What is the refund policy?' to the shipping topic instead of the returns topic. What is this failure pattern, and how do you fix it?

Click or press Enter to reveal answer

Answer

Wrong topic routing. Fix by adjusting trigger phrases — add 'refund policy' examples to the returns topic's triggers, and consider adding it as a negative example on the shipping topic to prevent misrouting.

Click to flip back

The Iterative Improvement Workflow

Improving an agent isn’t a one-shot fix. It’s a cycle:

  1. Review — Analyze evaluation results, identify the lowest-scoring areas
  2. Identify — Classify the failure pattern (wrong topic, poor grounding, outdated knowledge, missing coverage)
  3. Fix — Apply the targeted remedy for that pattern
  4. Re-evaluate — Run the same test set again and compare to the previous snapshot
  5. Repeat — If scores improved but haven’t reached threshold, continue the cycle

The key discipline is changing only one thing at a time. If you update knowledge sources AND rewrite instructions AND change topic triggers simultaneously, you won’t know which change helped (or hurt).

Lena’s improvement cycle

Cycle 1: Lena updates the formulary document with current drug interaction data. Re-evaluation: accuracy jumps from 68% to 79%. Progress, but still below the 85% threshold.

Cycle 2: She examines the remaining failures. Five cases involve newly approved medications that aren’t in any uploaded document yet. She adds a supplementary knowledge file covering the new approvals. Re-evaluation: accuracy reaches 84%.

Cycle 3: One last cluster of failures — the agent provides generic interaction warnings instead of severity-specific ones. Lena adds an instruction: “When reporting drug interactions, always include the severity level (major, moderate, minor) from the formulary.” Re-evaluation: accuracy hits 88%. Threshold met.

Three cycles. Three targeted fixes. Each one addressing a specific failure pattern identified from the data.

Question

Why should you change only one thing at a time during the iterative improvement cycle?

Click or press Enter to reveal answer

Answer

Because if you change multiple things simultaneously (knowledge, instructions, topics), you won't know which change caused the improvement or regression. Single-variable changes let you attribute results to specific fixes and build reliable understanding of what works.

Click to flip back

Failure PatternKey SymptomRoot CauseFix
Wrong topic routingLow topic matching on specific queriesInsufficient or ambiguous trigger phrasesAdd/refine trigger phrases and negative examples
HallucinationLow grounding quality scoresMissing knowledge or weak grounding instructionsUpload relevant data, strengthen grounding instructions
Outdated knowledgeLow accuracy with good groundingStale data in knowledge sourcesUpdate knowledge, set refresh schedule
Missing coverageFailures in a specific domainNo knowledge source for that areaUpload new knowledge or add graceful fallback topic
Knowledge Check

Lena's healthcare agent scores 82% on grounding quality but only 68% on accuracy for drug interaction queries. The knowledge source was last updated six months ago. What is the most likely failure pattern?

Knowledge Check

After updating knowledge sources, Lena re-runs her test set and accuracy improves from 68% to 79%. What should she do next?

Knowledge Check

Which combination of evaluation scores would MOST strongly suggest a hallucination problem?

Question

What are the five steps of the iterative improvement workflow?

Click or press Enter to reveal answer

Answer

1) Review — analyze evaluation results and find lowest-scoring areas. 2) Identify — classify the failure pattern. 3) Fix — apply the targeted remedy. 4) Re-evaluate — run the same test set and compare snapshots. 5) Repeat — continue until scores meet your threshold.

Click to flip back

Key Takeaways

  • Aggregate scores show the big picture; per-case breakdowns reveal exactly what failed and why
  • Four common failure patterns: wrong topic routing, hallucination, outdated knowledge, and missing coverage
  • Each pattern has a distinct signature in evaluation scores — learn to read the pattern, not just the number
  • The iterative improvement cycle (review, identify, fix, re-evaluate, repeat) is the disciplined path to quality
  • Change one variable at a time to attribute improvements to specific fixes

🎬 Video coming soon

Reviewing Results & Tuning Performance — Walkthrough

← Previous

Test Sets & Evaluation Methods

Next →

Solutions & Environment Variables

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.