Telemetry Interpretation and Agent Tuning

Simple explanation

Imagine you own a restaurant. You’ve got security cameras, receipt data, and comment cards. The cameras show you what’s happening — long queues, empty tables, confused waiters. The receipts show you what’s selling. The comment cards tell you how people feel.

But raw footage and receipts don’t fix anything. You need to interpret them. Why are queues forming at 6 PM? Because the kitchen is slow on pasta orders. Now you can fix it — retrain the pasta chef, simplify the menu, or add a second station.

Telemetry is your agent’s security camera, receipts, and comment cards. Tuning is the fix you apply after you understand the problem.

The Scenario

🤖 Jordan Reeves reviews the monitoring dashboards Sam built last sprint. The patient scheduling agent has a 72 percent resolution rate — below the 75 percent target. Worse, Jordan notices a pattern: conversations containing the word “reschedule” fail 30 percent of the time, compared to 8 percent for new appointments.

Jordan needs to dig into telemetry, figure out why, and tune the agent.

Telemetry Data Types

Every agent generates multiple telemetry streams. Understanding what each one tells you is the foundation of effective tuning:

Data Type	What It Contains	What It Reveals	Where to Find It
Conversation logs	Full transcript of user-agent exchanges	Exact failure points, misunderstood intents, topic gaps	Copilot Studio Analytics
Completion metrics	Token usage, response confidence scores, topic match scores	Whether the agent is confident in its responses or guessing	Application Insights custom events
Latency traces	End-to-end timing for each conversation turn	Bottlenecks in knowledge retrieval, API calls, or model inference	Application Insights dependency tracking
Error logs	System errors, connector failures, timeout events	Infrastructure issues vs logic issues	Application Insights exceptions
User satisfaction scores	CSAT ratings, thumbs up/down, free-text feedback	How users feel about the experience, independent of resolution	Copilot Studio Analytics and custom surveys

The Four Tuning Levers

When telemetry reveals a problem, you need to pick the right fix. Using the wrong lever wastes time and may not solve the issue:

Aspect	Prompt Tuning	Knowledge Source Tuning	Model Fine-Tuning	Flow Redesign
What You Change	System instructions, few-shot examples, guardrails	Documents, data sources, grounding content	The underlying model with custom training data	Conversation flow structure, branching logic, escalation rules
When to Use	Agent misunderstands intent or gives wrong tone	Agent lacks information or cites outdated data	Agent consistently fails on domain-specific language	Agent follows the wrong path or loops in conversation
Effort Level	Low — minutes to hours	Low to Medium — update and re-index	High — requires labelled data and compute	Medium — requires flow testing after changes
Risk Level	Low — easy to revert	Low — content swap	Medium — may affect other scenarios	Medium — may break existing paths
Example Fix	Add instruction: treat reschedule as a modification, not a cancellation	Add rescheduling policy document to knowledge base	Fine-tune on 5,000 healthcare scheduling conversations	Add explicit reschedule branch before the general booking flow

Feedback Backlog Analysis

User feedback is gold — but only if you mine it systematically. Here’s how Jordan processes the feedback backlog:

Step 1: Collect and Centralise

Pull feedback from all sources into one place: CSAT scores from Copilot Studio, thumbs-down transcripts, support tickets that mention the agent, and direct user comments. Jordan uses a Power Automate flow to pipe all of these into a Dataverse table.

Step 2: Categorise by Issue Type

Jordan tags each piece of feedback:

Intent misunderstanding — the agent didn’t understand what the user wanted
Incorrect information — the agent gave a wrong answer
Slow response — the agent took too long
Tone or style — the response was technically correct but felt robotic or unhelpful
Missing capability — the user wanted something the agent can’t do yet

Step 3: Identify Patterns

Sorting by category, Jordan sees that 45 percent of negative feedback falls under “intent misunderstanding,” and 60 percent of those involve rescheduling. The pattern is clear.

Step 4: Prioritise Fixes

Not every issue is worth fixing immediately. Jordan uses a simple impact matrix: frequency of the issue multiplied by severity. Rescheduling failures are high-frequency and high-severity (patients miss appointments), so it goes to the top of the list.

Exam Tip: The exam tests your ability to describe the process for using telemetry, not just name the tools. A common question pattern: “A solution architect notices that agent accuracy has declined. What should they do FIRST?” The answer is almost always “analyse telemetry to identify the root cause” — not “retrain the model” or “rewrite the prompt.”

AI-Based Analysis Tools

You don’t have to do all the analysis manually. Several AI-powered tools can accelerate the process:

Azure AI Foundry Evaluation — Run evaluation pipelines against conversation logs. Define metrics like groundedness (did the response stick to source material?), relevance (did it answer the question asked?), and coherence (did it make sense?). Foundry scores each conversation and flags outliers.

Automated Regression Testing — After every tuning change, run the same set of test conversations through the agent. Compare results to the baseline. If the rescheduling fix improved rescheduling but broke new appointments, you catch it immediately.

Anomaly Detection on Metrics — Azure Monitor can detect unusual patterns in time-series data. If error rate suddenly doubles at 2 AM, anomaly detection flags it even if it’s below your static alert threshold. This catches gradual degradation that static thresholds miss.

Deep Dive: Foundry evaluation uses a “judge” model to score responses. The judge model compares the agent’s response against the grounding source and the user’s question. Scores range from 1 to 5 for each dimension. A common tuning workflow: run evaluation, filter for scores below 3, review those conversations manually, then apply the appropriate tuning lever.

Applying the Fix

Jordan determines the rescheduling issue is an intent misunderstanding — the agent treats “reschedule” as a synonym for “cancel.” The fix requires two tuning levers:

Prompt tuning — Add an explicit instruction: “When a user says reschedule, modify, change, or move their appointment, treat this as a modification request, not a cancellation. Confirm the existing appointment details before offering new time slots.”
Flow redesign — Add a dedicated “Reschedule” topic in Copilot Studio that triggers on reschedule-related phrases, separate from the cancellation flow.

After deploying the fix, Jordan runs a regression test with 50 rescheduling scenarios and 50 new-appointment scenarios. Rescheduling accuracy jumps from 70 percent to 93 percent. New appointments remain at 92 percent. The fix is validated.

Flashcards

Question

What are the four tuning levers for improving agent performance?

Click or press Enter to reveal answer

Answer

1. Prompt tuning — change system instructions and examples. 2. Knowledge source tuning — update grounding documents and data. 3. Model fine-tuning — retrain with custom labelled data. 4. Flow redesign — restructure conversation branching and logic. Choose based on the type of failure the telemetry reveals.

Click to flip back

Question

What is the difference between telemetry interpretation and monitoring?

Click or press Enter to reveal answer

Answer

Monitoring is the continuous collection and display of metrics (dashboards, alerts). Telemetry interpretation is the analytical step where you examine the data to understand WHY a metric changed. Monitoring answers 'what happened.' Interpretation answers 'why it happened' and 'what to do about it.'

Click to flip back

Question

How should you prioritise items in a user feedback backlog?

Click or press Enter to reveal answer

Answer

Use an impact matrix: frequency multiplied by severity. High-frequency and high-severity issues go first. Low-frequency and low-severity issues go to the backlog. This ensures you fix the problems that affect the most users the most severely.

Click to flip back

Knowledge Check

Jordan discovers that the scheduling agent gives outdated clinic hours — it still shows pre-COVID hours from 2019. Conversation logs confirm the agent is confident in its responses but the information is wrong. Which tuning lever should Jordan use?

Knowledge Check

An agent's resolution rate has been steadily declining over 6 weeks. No code or prompt changes have been made. Which analysis approach is MOST appropriate?

Knowledge Check

What is the PRIMARY purpose of running automated regression tests after tuning an agent?

Next up: Testing Strategy — build a comprehensive test framework for AI agents, including how to use Copilot to generate test cases.