Test Sets & Evaluation Methods

Simple explanation

Why Test Sets Matter

🏢 AgentForge scenario: Priya’s team has built a recruitment agent for three clients. Before every release, QA lead Mira needs proof the agent handles real-world queries correctly. “We can’t ship and hope,” Mira says. “We need systematic evidence.”

That evidence comes from test sets — curated collections of test cases, each with an input (what a user might ask) and an expected outcome (what the agent should respond). Think of it as a standardized exam for your agent.

Without test sets, you’re relying on gut feeling. With them, you get repeatable, measurable quality scores every time you change an instruction, add knowledge, or update a topic.

Anatomy of a Test Case

Every test case has three parts:

Input — The user message or question (e.g., “What’s the application deadline for the senior developer role?”)
Expected output — The ideal response or key facts that must appear in the answer
Context (optional) — Additional grounding data the agent should reference when answering

When you run the test set, the system sends each input to your agent, captures the response, and scores it against the expected output using evaluation methods we’ll cover shortly.

Creating Test Sets

Copilot Studio gives you three ways to build test sets:

Manual creation

Write test cases by hand. Best for edge cases, known failure scenarios, and domain-specific queries that only a subject matter expert would know to test.

Mira starts here — she adds five tricky recruitment questions that previously confused the agent during client demos.

Auto-generate from knowledge and instructions

Copilot Studio can analyze your agent’s knowledge sources and system instructions to automatically generate test cases. The system reads your uploaded documents and creates question-answer pairs based on the content.

This is fast — you can generate dozens of cases in minutes — but always review them. Auto-generated cases sometimes miss nuance or create unrealistic scenarios that real users would never ask.

Import from conversation logs

If your agent has been running in production or UAT, you can import real user conversations as test cases. This gives you the most realistic coverage because these are actual questions people asked.

Mira pulls the last 200 conversations from the staging environment, filters for the 40 most diverse queries, and imports them as a test set baseline.

Question

What are the three ways to create test sets in Copilot Studio?

Click or press Enter to reveal answer

Answer

1) Manual creation — hand-write test cases for edge cases and domain-specific scenarios. 2) Auto-generate — system creates cases from your knowledge sources and instructions. 3) Import from logs — pull real conversations from production or UAT for the most realistic coverage.

Click to flip back

Evaluation Methods

Once you have a test set, you need to decide how to score the results. Copilot Studio offers several evaluation methods, each measuring a different dimension of quality.

Method	What It Measures	When to Use
Accuracy	Does the response contain the correct factual answer?	Knowledge-heavy agents where getting the right answer matters most
Grounding quality	Is the response grounded in provided knowledge sources rather than fabricated?	Agents with uploaded documents — ensures answers come from your data, not hallucination
Topic matching	Did the agent route to the correct topic for the given input?	Multi-topic agents where misrouting causes completely wrong responses
Response quality	Is the response helpful, clear, and well-structured overall?	Customer-facing agents where tone, clarity, and professionalism matter

Combining methods

In practice, you use multiple evaluation methods together. Mira configures AgentForge’s recruitment agent tests with accuracy (are job details correct?), grounding quality (are answers from the client’s job postings, not made up?), and topic matching (does a benefits question go to the benefits topic, not the application topic?).

Each method produces a score. Together they give you a multi-dimensional picture of agent health — like a medical checkup that tests blood pressure, heart rate, and cholesterol, not just one number.

Question

What does 'grounding quality' measure in agent evaluation?

Click or press Enter to reveal answer

Answer

Grounding quality measures whether the agent's response is derived from the provided knowledge sources rather than fabricated (hallucinated) content. A high grounding score means answers come from your actual data, not the model's imagination.

Click to flip back

Question

When would you prioritize 'topic matching' evaluation over other methods?

Click or press Enter to reveal answer

Answer

When your agent has multiple topics and misrouting causes incorrect responses. Topic matching ensures the agent directs each query to the correct topic handler — critical in multi-domain agents where a wrong route means a completely wrong answer.

Click to flip back

Running an Evaluation

The workflow is straightforward:

Select your test set — choose which collection of test cases to run
Choose evaluation methods — pick one or more scoring dimensions
Run the evaluation — the system sends each input to the agent and captures responses
Review aggregate results — see overall pass rates, average scores, and per-method breakdowns
Drill into failures — examine individual test cases that scored below threshold

Each evaluation run creates a snapshot. This means you can compare results over time — did your latest instruction change improve accuracy or make it worse? Snapshots are your before-and-after evidence.

Question

Why does each evaluation run create a snapshot?

Click or press Enter to reveal answer

Answer

Snapshots enable comparison over time. When you change instructions, knowledge, or topics, you can re-run the same test set and compare scores to see if quality improved or regressed. Without snapshots, you have no baseline for comparison.

Click to flip back

Mira’s Test Strategy

🏢 Back at AgentForge, Mira builds a layered testing strategy for the recruitment agent:

Baseline test set (auto-generated, 50 cases) — covers general knowledge from job postings across all three clients
Edge case test set (manual, 15 cases) — tricky questions about visa sponsorship, remote work policies, and salary range inquiries
Regression test set (imported from logs, 40 cases) — real questions that previously caused issues in staging

She runs all three after every change to the agent’s instructions or knowledge sources. The baseline catches broad regressions. The edge cases catch specific known pitfalls. The regression set ensures old bugs don’t return.

“Think of it like a safety net with three layers,” Mira tells Priya. “If something slips through one net, the others catch it.”

This layered approach is exactly what the exam expects you to understand — test sets are not one-and-done. They’re living artifacts that grow alongside your agent.

Question

What are the three layers in Mira's test strategy, and why does each layer exist?

Click or press Enter to reveal answer

Answer

1) Baseline test set (auto-generated) — broad coverage of general knowledge. 2) Edge case test set (manual) — catches specific known pitfalls that auto-generation misses. 3) Regression test set (from logs) — prevents old bugs from returning. Together they form a multi-layer safety net.

Click to flip back

Knowledge Check

Mira wants to ensure the recruitment agent's answers come from uploaded job postings, not fabricated content. Which evaluation method should she prioritize?

Knowledge Check

Which test set creation method gives the MOST realistic coverage of actual user behavior?

Knowledge Check

After changing an agent's system instructions, what is the BEST practice before deploying to production?

Key Takeaways

Test sets are collections of input-expected output pairs that let you systematically evaluate agent quality
Create test sets three ways: manually (edge cases), auto-generate (broad coverage), or import from logs (real usage)
Evaluation methods each measure a different dimension: accuracy, grounding quality, topic matching, and response quality
Every evaluation run creates a snapshot, enabling before-and-after comparison when you change your agent
Layer multiple test sets for comprehensive coverage — baseline, edge case, and regression sets together form a robust safety net