🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AB-620 Domain 3
Domain 3 — Module 1 of 6 17%
23 of 28 overall

AB-620 Study Guide

Domain 1: Plan and Configure Agent Solutions

  • Getting Started: Copilot Studio for Developers Free
  • Planning Enterprise Integration and Reusable Components Free
  • Identity Strategy for Agents Free
  • Channels, Deployment and Audience Design Free
  • Responsible AI and Security Governance Free
  • Agent Flows: Build, Monitor and Handle Errors Free
  • Human-in-the-Loop Agent Flows Free
  • Topics, Tools and Variables Free
  • Advanced Responses: Custom Prompts and Generative Answers Free
  • API Calls, HTTP Requests and Adaptive Cards Free

Domain 2: Integrate and Extend Agents in Copilot Studio

  • Enterprise Knowledge Sources: The Big Picture
  • Copilot Connectors and Power Platform Connectors
  • Azure AI Search as a Knowledge Source
  • Adding Tools: Custom Connectors and REST APIs
  • MCP Tools: Model Context Protocol in Action
  • Computer Use: Agent-Driven UI Automation
  • Multi-Agent Solutions: Design and Agent Reuse
  • Integrating Foundry Agents
  • Fabric Data Agents: Analytics Meets AI
  • A2A Protocol: Cross-Platform Agent Collaboration
  • Grounded Answers: Azure AI Search with Foundry
  • Foundry Model Catalog and Application Insights

Domain 3: Test and Manage Agents

  • Test Sets & Evaluation Methods
  • Reviewing Results & Tuning Performance
  • Solutions & Environment Variables
  • Power Platform Pipelines for Agent ALM
  • Agent Lifecycle: From Dev to Production
  • Exam Prep: Diagnostic Review

AB-620 Study Guide

Domain 1: Plan and Configure Agent Solutions

  • Getting Started: Copilot Studio for Developers Free
  • Planning Enterprise Integration and Reusable Components Free
  • Identity Strategy for Agents Free
  • Channels, Deployment and Audience Design Free
  • Responsible AI and Security Governance Free
  • Agent Flows: Build, Monitor and Handle Errors Free
  • Human-in-the-Loop Agent Flows Free
  • Topics, Tools and Variables Free
  • Advanced Responses: Custom Prompts and Generative Answers Free
  • API Calls, HTTP Requests and Adaptive Cards Free

Domain 2: Integrate and Extend Agents in Copilot Studio

  • Enterprise Knowledge Sources: The Big Picture
  • Copilot Connectors and Power Platform Connectors
  • Azure AI Search as a Knowledge Source
  • Adding Tools: Custom Connectors and REST APIs
  • MCP Tools: Model Context Protocol in Action
  • Computer Use: Agent-Driven UI Automation
  • Multi-Agent Solutions: Design and Agent Reuse
  • Integrating Foundry Agents
  • Fabric Data Agents: Analytics Meets AI
  • A2A Protocol: Cross-Platform Agent Collaboration
  • Grounded Answers: Azure AI Search with Foundry
  • Foundry Model Catalog and Application Insights

Domain 3: Test and Manage Agents

  • Test Sets & Evaluation Methods
  • Reviewing Results & Tuning Performance
  • Solutions & Environment Variables
  • Power Platform Pipelines for Agent ALM
  • Agent Lifecycle: From Dev to Production
  • Exam Prep: Diagnostic Review
Domain 3: Test and Manage Agents Premium ⏱ ~13 min read

Test Sets & Evaluation Methods

Create test sets for your agents, choose the right evaluation method, and systematically measure agent quality before shipping to production.

☕ Simple explanation

Why Test Sets Matter

🏢 AgentForge scenario: Priya’s team has built a recruitment agent for three clients. Before every release, QA lead Mira needs proof the agent handles real-world queries correctly. “We can’t ship and hope,” Mira says. “We need systematic evidence.”

That evidence comes from test sets — curated collections of test cases, each with an input (what a user might ask) and an expected outcome (what the agent should respond). Think of it as a standardized exam for your agent.

Without test sets, you’re relying on gut feeling. With them, you get repeatable, measurable quality scores every time you change an instruction, add knowledge, or update a topic.

Anatomy of a Test Case

Every test case has three parts:

  1. Input — The user message or question (e.g., “What’s the application deadline for the senior developer role?”)
  2. Expected output — The ideal response or key facts that must appear in the answer
  3. Context (optional) — Additional grounding data the agent should reference when answering

When you run the test set, the system sends each input to your agent, captures the response, and scores it against the expected output using evaluation methods we’ll cover shortly.

Creating Test Sets

Copilot Studio gives you three ways to build test sets:

Manual creation

Write test cases by hand. Best for edge cases, known failure scenarios, and domain-specific queries that only a subject matter expert would know to test.

Mira starts here — she adds five tricky recruitment questions that previously confused the agent during client demos.

Auto-generate from knowledge and instructions

Copilot Studio can analyze your agent’s knowledge sources and system instructions to automatically generate test cases. The system reads your uploaded documents and creates question-answer pairs based on the content.

This is fast — you can generate dozens of cases in minutes — but always review them. Auto-generated cases sometimes miss nuance or create unrealistic scenarios that real users would never ask.

Import from conversation logs

If your agent has been running in production or UAT, you can import real user conversations as test cases. This gives you the most realistic coverage because these are actual questions people asked.

Mira pulls the last 200 conversations from the staging environment, filters for the 40 most diverse queries, and imports them as a test set baseline.

Question

What are the three ways to create test sets in Copilot Studio?

Click or press Enter to reveal answer

Answer

1) Manual creation — hand-write test cases for edge cases and domain-specific scenarios. 2) Auto-generate — system creates cases from your knowledge sources and instructions. 3) Import from logs — pull real conversations from production or UAT for the most realistic coverage.

Click to flip back

Evaluation Methods

Once you have a test set, you need to decide how to score the results. Copilot Studio offers several evaluation methods, each measuring a different dimension of quality.

MethodWhat It MeasuresWhen to Use
AccuracyDoes the response contain the correct factual answer?Knowledge-heavy agents where getting the right answer matters most
Grounding qualityIs the response grounded in provided knowledge sources rather than fabricated?Agents with uploaded documents — ensures answers come from your data, not hallucination
Topic matchingDid the agent route to the correct topic for the given input?Multi-topic agents where misrouting causes completely wrong responses
Response qualityIs the response helpful, clear, and well-structured overall?Customer-facing agents where tone, clarity, and professionalism matter

Combining methods

In practice, you use multiple evaluation methods together. Mira configures AgentForge’s recruitment agent tests with accuracy (are job details correct?), grounding quality (are answers from the client’s job postings, not made up?), and topic matching (does a benefits question go to the benefits topic, not the application topic?).

Each method produces a score. Together they give you a multi-dimensional picture of agent health — like a medical checkup that tests blood pressure, heart rate, and cholesterol, not just one number.

Question

What does 'grounding quality' measure in agent evaluation?

Click or press Enter to reveal answer

Answer

Grounding quality measures whether the agent's response is derived from the provided knowledge sources rather than fabricated (hallucinated) content. A high grounding score means answers come from your actual data, not the model's imagination.

Click to flip back

Question

When would you prioritize 'topic matching' evaluation over other methods?

Click or press Enter to reveal answer

Answer

When your agent has multiple topics and misrouting causes incorrect responses. Topic matching ensures the agent directs each query to the correct topic handler — critical in multi-domain agents where a wrong route means a completely wrong answer.

Click to flip back

Running an Evaluation

The workflow is straightforward:

  1. Select your test set — choose which collection of test cases to run
  2. Choose evaluation methods — pick one or more scoring dimensions
  3. Run the evaluation — the system sends each input to the agent and captures responses
  4. Review aggregate results — see overall pass rates, average scores, and per-method breakdowns
  5. Drill into failures — examine individual test cases that scored below threshold

Each evaluation run creates a snapshot. This means you can compare results over time — did your latest instruction change improve accuracy or make it worse? Snapshots are your before-and-after evidence.

Question

Why does each evaluation run create a snapshot?

Click or press Enter to reveal answer

Answer

Snapshots enable comparison over time. When you change instructions, knowledge, or topics, you can re-run the same test set and compare scores to see if quality improved or regressed. Without snapshots, you have no baseline for comparison.

Click to flip back

Mira’s Test Strategy

🏢 Back at AgentForge, Mira builds a layered testing strategy for the recruitment agent:

  • Baseline test set (auto-generated, 50 cases) — covers general knowledge from job postings across all three clients
  • Edge case test set (manual, 15 cases) — tricky questions about visa sponsorship, remote work policies, and salary range inquiries
  • Regression test set (imported from logs, 40 cases) — real questions that previously caused issues in staging

She runs all three after every change to the agent’s instructions or knowledge sources. The baseline catches broad regressions. The edge cases catch specific known pitfalls. The regression set ensures old bugs don’t return.

“Think of it like a safety net with three layers,” Mira tells Priya. “If something slips through one net, the others catch it.”

This layered approach is exactly what the exam expects you to understand — test sets are not one-and-done. They’re living artifacts that grow alongside your agent.

Question

What are the three layers in Mira's test strategy, and why does each layer exist?

Click or press Enter to reveal answer

Answer

1) Baseline test set (auto-generated) — broad coverage of general knowledge. 2) Edge case test set (manual) — catches specific known pitfalls that auto-generation misses. 3) Regression test set (from logs) — prevents old bugs from returning. Together they form a multi-layer safety net.

Click to flip back

Knowledge Check

Mira wants to ensure the recruitment agent's answers come from uploaded job postings, not fabricated content. Which evaluation method should she prioritize?

Knowledge Check

Which test set creation method gives the MOST realistic coverage of actual user behavior?

Knowledge Check

After changing an agent's system instructions, what is the BEST practice before deploying to production?

Key Takeaways

  • Test sets are collections of input-expected output pairs that let you systematically evaluate agent quality
  • Create test sets three ways: manually (edge cases), auto-generate (broad coverage), or import from logs (real usage)
  • Evaluation methods each measure a different dimension: accuracy, grounding quality, topic matching, and response quality
  • Every evaluation run creates a snapshot, enabling before-and-after comparison when you change your agent
  • Layer multiple test sets for comprehensive coverage — baseline, edge case, and regression sets together form a robust safety net

🎬 Video coming soon

Test Sets & Evaluation Methods — Walkthrough

← Previous

Foundry Model Catalog and Application Insights

Next →

Reviewing Results & Tuning Performance

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.