Test Sets & Evaluation Methods
Create test sets for your agents, choose the right evaluation method, and systematically measure agent quality before shipping to production.
Why Test Sets Matter
🏢 AgentForge scenario: Priya’s team has built a recruitment agent for three clients. Before every release, QA lead Mira needs proof the agent handles real-world queries correctly. “We can’t ship and hope,” Mira says. “We need systematic evidence.”
That evidence comes from test sets — curated collections of test cases, each with an input (what a user might ask) and an expected outcome (what the agent should respond). Think of it as a standardized exam for your agent.
Without test sets, you’re relying on gut feeling. With them, you get repeatable, measurable quality scores every time you change an instruction, add knowledge, or update a topic.
Anatomy of a Test Case
Every test case has three parts:
- Input — The user message or question (e.g., “What’s the application deadline for the senior developer role?”)
- Expected output — The ideal response or key facts that must appear in the answer
- Context (optional) — Additional grounding data the agent should reference when answering
When you run the test set, the system sends each input to your agent, captures the response, and scores it against the expected output using evaluation methods we’ll cover shortly.
Creating Test Sets
Copilot Studio gives you three ways to build test sets:
Manual creation
Write test cases by hand. Best for edge cases, known failure scenarios, and domain-specific queries that only a subject matter expert would know to test.
Mira starts here — she adds five tricky recruitment questions that previously confused the agent during client demos.
Auto-generate from knowledge and instructions
Copilot Studio can analyze your agent’s knowledge sources and system instructions to automatically generate test cases. The system reads your uploaded documents and creates question-answer pairs based on the content.
This is fast — you can generate dozens of cases in minutes — but always review them. Auto-generated cases sometimes miss nuance or create unrealistic scenarios that real users would never ask.
Import from conversation logs
If your agent has been running in production or UAT, you can import real user conversations as test cases. This gives you the most realistic coverage because these are actual questions people asked.
Mira pulls the last 200 conversations from the staging environment, filters for the 40 most diverse queries, and imports them as a test set baseline.
Evaluation Methods
Once you have a test set, you need to decide how to score the results. Copilot Studio offers several evaluation methods, each measuring a different dimension of quality.
| Method | What It Measures | When to Use |
|---|---|---|
| Accuracy | Does the response contain the correct factual answer? | Knowledge-heavy agents where getting the right answer matters most |
| Grounding quality | Is the response grounded in provided knowledge sources rather than fabricated? | Agents with uploaded documents — ensures answers come from your data, not hallucination |
| Topic matching | Did the agent route to the correct topic for the given input? | Multi-topic agents where misrouting causes completely wrong responses |
| Response quality | Is the response helpful, clear, and well-structured overall? | Customer-facing agents where tone, clarity, and professionalism matter |
Combining methods
In practice, you use multiple evaluation methods together. Mira configures AgentForge’s recruitment agent tests with accuracy (are job details correct?), grounding quality (are answers from the client’s job postings, not made up?), and topic matching (does a benefits question go to the benefits topic, not the application topic?).
Each method produces a score. Together they give you a multi-dimensional picture of agent health — like a medical checkup that tests blood pressure, heart rate, and cholesterol, not just one number.
Running an Evaluation
The workflow is straightforward:
- Select your test set — choose which collection of test cases to run
- Choose evaluation methods — pick one or more scoring dimensions
- Run the evaluation — the system sends each input to the agent and captures responses
- Review aggregate results — see overall pass rates, average scores, and per-method breakdowns
- Drill into failures — examine individual test cases that scored below threshold
Each evaluation run creates a snapshot. This means you can compare results over time — did your latest instruction change improve accuracy or make it worse? Snapshots are your before-and-after evidence.
Mira’s Test Strategy
🏢 Back at AgentForge, Mira builds a layered testing strategy for the recruitment agent:
- Baseline test set (auto-generated, 50 cases) — covers general knowledge from job postings across all three clients
- Edge case test set (manual, 15 cases) — tricky questions about visa sponsorship, remote work policies, and salary range inquiries
- Regression test set (imported from logs, 40 cases) — real questions that previously caused issues in staging
She runs all three after every change to the agent’s instructions or knowledge sources. The baseline catches broad regressions. The edge cases catch specific known pitfalls. The regression set ensures old bugs don’t return.
“Think of it like a safety net with three layers,” Mira tells Priya. “If something slips through one net, the others catch it.”
This layered approach is exactly what the exam expects you to understand — test sets are not one-and-done. They’re living artifacts that grow alongside your agent.
Mira wants to ensure the recruitment agent's answers come from uploaded job postings, not fabricated content. Which evaluation method should she prioritize?
Which test set creation method gives the MOST realistic coverage of actual user behavior?
After changing an agent's system instructions, what is the BEST practice before deploying to production?
Key Takeaways
- Test sets are collections of input-expected output pairs that let you systematically evaluate agent quality
- Create test sets three ways: manually (edge cases), auto-generate (broad coverage), or import from logs (real usage)
- Evaluation methods each measure a different dimension: accuracy, grounding quality, topic matching, and response quality
- Every evaluation run creates a snapshot, enabling before-and-after comparison when you change your agent
- Layer multiple test sets for comprehensive coverage — baseline, edge case, and regression sets together form a robust safety net
🎬 Video coming soon
Test Sets & Evaluation Methods — Walkthrough