Telemetry Interpretation and Agent Tuning
Interpret telemetry data, analyse user feedback backlogs, and apply AI-based tools to identify issues and tune agent performance.
Telemetry Interpretation and Agent Tuning
Imagine you own a restaurant. You’ve got security cameras, receipt data, and comment cards. The cameras show you what’s happening — long queues, empty tables, confused waiters. The receipts show you what’s selling. The comment cards tell you how people feel.
But raw footage and receipts don’t fix anything. You need to interpret them. Why are queues forming at 6 PM? Because the kitchen is slow on pasta orders. Now you can fix it — retrain the pasta chef, simplify the menu, or add a second station.
Telemetry is your agent’s security camera, receipts, and comment cards. Tuning is the fix you apply after you understand the problem.
The Scenario
🤖 Jordan Reeves reviews the monitoring dashboards Sam built last sprint. The patient scheduling agent has a 72 percent resolution rate — below the 75 percent target. Worse, Jordan notices a pattern: conversations containing the word “reschedule” fail 30 percent of the time, compared to 8 percent for new appointments.
Jordan needs to dig into telemetry, figure out why, and tune the agent.
Telemetry Data Types
Every agent generates multiple telemetry streams. Understanding what each one tells you is the foundation of effective tuning:
| Data Type | What It Contains | What It Reveals | Where to Find It |
|---|---|---|---|
| Conversation logs | Full transcript of user-agent exchanges | Exact failure points, misunderstood intents, topic gaps | Copilot Studio Analytics |
| Completion metrics | Token usage, response confidence scores, topic match scores | Whether the agent is confident in its responses or guessing | Application Insights custom events |
| Latency traces | End-to-end timing for each conversation turn | Bottlenecks in knowledge retrieval, API calls, or model inference | Application Insights dependency tracking |
| Error logs | System errors, connector failures, timeout events | Infrastructure issues vs logic issues | Application Insights exceptions |
| User satisfaction scores | CSAT ratings, thumbs up/down, free-text feedback | How users feel about the experience, independent of resolution | Copilot Studio Analytics and custom surveys |
The Four Tuning Levers
When telemetry reveals a problem, you need to pick the right fix. Using the wrong lever wastes time and may not solve the issue:
| Aspect | Prompt Tuning | Knowledge Source Tuning | Model Fine-Tuning | Flow Redesign |
|---|---|---|---|---|
| What You Change | System instructions, few-shot examples, guardrails | Documents, data sources, grounding content | The underlying model with custom training data | Conversation flow structure, branching logic, escalation rules |
| When to Use | Agent misunderstands intent or gives wrong tone | Agent lacks information or cites outdated data | Agent consistently fails on domain-specific language | Agent follows the wrong path or loops in conversation |
| Effort Level | Low — minutes to hours | Low to Medium — update and re-index | High — requires labelled data and compute | Medium — requires flow testing after changes |
| Risk Level | Low — easy to revert | Low — content swap | Medium — may affect other scenarios | Medium — may break existing paths |
| Example Fix | Add instruction: treat reschedule as a modification, not a cancellation | Add rescheduling policy document to knowledge base | Fine-tune on 5,000 healthcare scheduling conversations | Add explicit reschedule branch before the general booking flow |
Feedback Backlog Analysis
User feedback is gold — but only if you mine it systematically. Here’s how Jordan processes the feedback backlog:
Step 1: Collect and Centralise
Pull feedback from all sources into one place: CSAT scores from Copilot Studio, thumbs-down transcripts, support tickets that mention the agent, and direct user comments. Jordan uses a Power Automate flow to pipe all of these into a Dataverse table.
Step 2: Categorise by Issue Type
Jordan tags each piece of feedback:
- Intent misunderstanding — the agent didn’t understand what the user wanted
- Incorrect information — the agent gave a wrong answer
- Slow response — the agent took too long
- Tone or style — the response was technically correct but felt robotic or unhelpful
- Missing capability — the user wanted something the agent can’t do yet
Step 3: Identify Patterns
Sorting by category, Jordan sees that 45 percent of negative feedback falls under “intent misunderstanding,” and 60 percent of those involve rescheduling. The pattern is clear.
Step 4: Prioritise Fixes
Not every issue is worth fixing immediately. Jordan uses a simple impact matrix: frequency of the issue multiplied by severity. Rescheduling failures are high-frequency and high-severity (patients miss appointments), so it goes to the top of the list.
Exam Tip: The exam tests your ability to describe the process for using telemetry, not just name the tools. A common question pattern: “A solution architect notices that agent accuracy has declined. What should they do FIRST?” The answer is almost always “analyse telemetry to identify the root cause” — not “retrain the model” or “rewrite the prompt.”
AI-Based Analysis Tools
You don’t have to do all the analysis manually. Several AI-powered tools can accelerate the process:
Azure AI Foundry Evaluation — Run evaluation pipelines against conversation logs. Define metrics like groundedness (did the response stick to source material?), relevance (did it answer the question asked?), and coherence (did it make sense?). Foundry scores each conversation and flags outliers.
Automated Regression Testing — After every tuning change, run the same set of test conversations through the agent. Compare results to the baseline. If the rescheduling fix improved rescheduling but broke new appointments, you catch it immediately.
Anomaly Detection on Metrics — Azure Monitor can detect unusual patterns in time-series data. If error rate suddenly doubles at 2 AM, anomaly detection flags it even if it’s below your static alert threshold. This catches gradual degradation that static thresholds miss.
Deep Dive: Foundry evaluation uses a “judge” model to score responses. The judge model compares the agent’s response against the grounding source and the user’s question. Scores range from 1 to 5 for each dimension. A common tuning workflow: run evaluation, filter for scores below 3, review those conversations manually, then apply the appropriate tuning lever.
Applying the Fix
Jordan determines the rescheduling issue is an intent misunderstanding — the agent treats “reschedule” as a synonym for “cancel.” The fix requires two tuning levers:
-
Prompt tuning — Add an explicit instruction: “When a user says reschedule, modify, change, or move their appointment, treat this as a modification request, not a cancellation. Confirm the existing appointment details before offering new time slots.”
-
Flow redesign — Add a dedicated “Reschedule” topic in Copilot Studio that triggers on reschedule-related phrases, separate from the cancellation flow.
After deploying the fix, Jordan runs a regression test with 50 rescheduling scenarios and 50 new-appointment scenarios. Rescheduling accuracy jumps from 70 percent to 93 percent. New appointments remain at 92 percent. The fix is validated.
Flashcards
Knowledge Check
Jordan discovers that the scheduling agent gives outdated clinic hours — it still shows pre-COVID hours from 2019. Conversation logs confirm the agent is confident in its responses but the information is wrong. Which tuning lever should Jordan use?
An agent's resolution rate has been steadily declining over 6 weeks. No code or prompt changes have been made. Which analysis approach is MOST appropriate?
What is the PRIMARY purpose of running automated regression tests after tuning an agent?
🎬 Video coming soon
Next up: Testing Strategy — build a comprehensive test framework for AI agents, including how to use Copilot to generate test cases.