Agent Monitoring & Error Analysis

Why monitor agents?

Simple explanation

A deployed agent is like a new employee — competent but unpredictable. You wouldn’t hire someone and never check their work.

Monitoring tells you: Is the agent accomplishing its goals? Is it using tools correctly? Is it staying within boundaries? When it fails, why did it fail? Without monitoring, problems compound silently until users complain.

Agent monitoring metrics

Category	Metric	What to Watch For
Performance	Response latency (P50, P95, P99)	Latency spikes indicate tool failures or model issues
Reliability	Success rate (% of requests completed)	Drop below 95% signals a systemic problem
Tool usage	Tool call frequency and success rate	Tool failures cascade into agent failures
Quality	Groundedness, relevance, safety scores	Quality declining without code changes = drift
Cost	Tokens per request, cost per conversation	Unexpected cost increases signal inefficiency
Safety	Content filter trigger rate	Increasing triggers may indicate misuse or drift

Error analysis framework

When an agent fails, follow this investigation order:

Step	What to Check	Common Findings
1. Trace the request	Follow the full request through Foundry tracing	Identifies which step failed
2. Check tool calls	Did tools execute correctly?	API timeouts, malformed parameters, auth failures
3. Check retrieval	Was the right context retrieved?	Stale index, poor search relevance
4. Check reasoning	Did the model reason correctly?	Wrong tool selection, poor planning
5. Check safety	Did content filters block the response?	False positive on legitimate content
6. Check context	Was conversation history too long/corrupted?	Context window overflow, memory issues

Common agent failures and their causes
Feature	Symptom	Likely Cause	Investigation Step
Agent returns 'I cannot help with that'	Unexpected response	Content filter false positive or missing tool	Check safety filters and tool availability
Agent gives wrong information	Quality issue	Stale index or poor retrieval	Check search index health and relevance
Agent takes wrong action	Reasoning error	Ambiguous tool schemas or instructions	Review tool schemas and system prompt
Agent times out	Performance issue	Tool API timeout or overloaded model	Check tool latency and model capacity
Agent loops endlessly	Planning failure	Circular tool calls or missing termination condition	Review orchestration logic and add iteration limits

Real-world example: Kai's debugging session

Kai gets a report that the shipping assistant is giving wrong delivery estimates. Investigation:

Trace: Finds the agent is calling estimate_delivery tool correctly
Tool check: Tool is returning data from last month’s rate table
Root cause: The rate table API was updated but the agent’s tool wasn’t reconfigured to point to the new endpoint
Fix: Update the tool’s API endpoint, add a monitoring alert for rate table freshness
Prevention: Add an automated test that verifies tool responses match expected schema

Total investigation time: 20 minutes using Foundry tracing. Without tracing, this could have taken days.

Exam tip: Error analysis order

The exam may ask “where should you start investigating?” Common pattern:

Tools first — most agent failures are tool failures (API down, wrong params, auth expired)
Retrieval second — stale data is the second most common cause
Model reasoning third — the model itself is usually not the problem

Start from the outside (tools) and work inward (model). Don’t blame the model first.

Key terms

Question

What is Foundry tracing?

Click or press Enter to reveal answer

Answer

A built-in observability feature that records every step of an agent's execution — model calls, tool invocations, retrieval queries, and responses. Essential for debugging agent failures and understanding behaviour.

Click to flip back

Question

What is an agent loop (infinite loop)?

Click or press Enter to reveal answer

Answer

When an agent gets stuck in a cycle of tool calls without making progress. Common causes: circular dependencies between tools, missing termination conditions, or conflicting instructions. Prevented by setting iteration limits.

Click to flip back

Question

What is P95 latency?

Click or press Enter to reveal answer

Answer

The 95th percentile response time — 95% of requests complete faster than this value. Used as an SLA metric because it captures the 'typical worst case' that real users experience, excluding extreme outliers.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial's compliance agent suddenly starts returning 'I cannot assist with that request' for legitimate compliance queries. No code changes were deployed. What should they investigate first?

Knowledge Check

NeuralMed notices their patient agent's average tokens per conversation has doubled over the past week, increasing costs. Usage patterns haven't changed. What's the most likely cause?