Agent Monitoring & Error Analysis
Deployed agents need ongoing supervision. Learn how to integrate monitoring, evaluate agent behaviour in production, and perform error analysis when things go wrong.
Why monitor agents?
A deployed agent is like a new employee — competent but unpredictable. You wouldn’t hire someone and never check their work.
Monitoring tells you: Is the agent accomplishing its goals? Is it using tools correctly? Is it staying within boundaries? When it fails, why did it fail? Without monitoring, problems compound silently until users complain.
Agent monitoring metrics
| Category | Metric | What to Watch For |
|---|---|---|
| Performance | Response latency (P50, P95, P99) | Latency spikes indicate tool failures or model issues |
| Reliability | Success rate (% of requests completed) | Drop below 95% signals a systemic problem |
| Tool usage | Tool call frequency and success rate | Tool failures cascade into agent failures |
| Quality | Groundedness, relevance, safety scores | Quality declining without code changes = drift |
| Cost | Tokens per request, cost per conversation | Unexpected cost increases signal inefficiency |
| Safety | Content filter trigger rate | Increasing triggers may indicate misuse or drift |
Error analysis framework
When an agent fails, follow this investigation order:
| Step | What to Check | Common Findings |
|---|---|---|
| 1. Trace the request | Follow the full request through Foundry tracing | Identifies which step failed |
| 2. Check tool calls | Did tools execute correctly? | API timeouts, malformed parameters, auth failures |
| 3. Check retrieval | Was the right context retrieved? | Stale index, poor search relevance |
| 4. Check reasoning | Did the model reason correctly? | Wrong tool selection, poor planning |
| 5. Check safety | Did content filters block the response? | False positive on legitimate content |
| 6. Check context | Was conversation history too long/corrupted? | Context window overflow, memory issues |
| Feature | Symptom | Likely Cause | Investigation Step |
|---|---|---|---|
| Agent returns 'I cannot help with that' | Unexpected response | Content filter false positive or missing tool | Check safety filters and tool availability |
| Agent gives wrong information | Quality issue | Stale index or poor retrieval | Check search index health and relevance |
| Agent takes wrong action | Reasoning error | Ambiguous tool schemas or instructions | Review tool schemas and system prompt |
| Agent times out | Performance issue | Tool API timeout or overloaded model | Check tool latency and model capacity |
| Agent loops endlessly | Planning failure | Circular tool calls or missing termination condition | Review orchestration logic and add iteration limits |
Real-world example: Kai's debugging session
Kai gets a report that the shipping assistant is giving wrong delivery estimates. Investigation:
- Trace: Finds the agent is calling
estimate_deliverytool correctly - Tool check: Tool is returning data from last month’s rate table
- Root cause: The rate table API was updated but the agent’s tool wasn’t reconfigured to point to the new endpoint
- Fix: Update the tool’s API endpoint, add a monitoring alert for rate table freshness
- Prevention: Add an automated test that verifies tool responses match expected schema
Total investigation time: 20 minutes using Foundry tracing. Without tracing, this could have taken days.
Exam tip: Error analysis order
The exam may ask “where should you start investigating?” Common pattern:
- Tools first — most agent failures are tool failures (API down, wrong params, auth expired)
- Retrieval second — stale data is the second most common cause
- Model reasoning third — the model itself is usually not the problem
Start from the outside (tools) and work inward (model). Don’t blame the model first.
Key terms
Knowledge check
Atlas Financial's compliance agent suddenly starts returning 'I cannot assist with that request' for legitimate compliance queries. No code changes were deployed. What should they investigate first?
NeuralMed notices their patient agent's average tokens per conversation has doubled over the past week, increasing costs. Usage patterns haven't changed. What's the most likely cause?
🎬 Video coming soon