🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 2
Domain 2 — Module 9 of 11 82%
17 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 2: Implement Generative AI and Agentic Solutions Premium ⏱ ~12 min read

Agent Monitoring & Error Analysis

Deployed agents need ongoing supervision. Learn how to integrate monitoring, evaluate agent behaviour in production, and perform error analysis when things go wrong.

Why monitor agents?

☕ Simple explanation

A deployed agent is like a new employee — competent but unpredictable. You wouldn’t hire someone and never check their work.

Monitoring tells you: Is the agent accomplishing its goals? Is it using tools correctly? Is it staying within boundaries? When it fails, why did it fail? Without monitoring, problems compound silently until users complain.

Agent monitoring covers three distinct activities:

  • Runtime monitoring — real-time metrics on agent performance, tool calls, and errors
  • Behaviour evaluation — periodic assessment of response quality, groundedness, and goal achievement
  • Error analysis — investigating failures to understand root causes and prevent recurrence

Foundry provides built-in tracing and evaluation tools, integrated with Azure Monitor for dashboards and alerting.

Agent monitoring metrics

CategoryMetricWhat to Watch For
PerformanceResponse latency (P50, P95, P99)Latency spikes indicate tool failures or model issues
ReliabilitySuccess rate (% of requests completed)Drop below 95% signals a systemic problem
Tool usageTool call frequency and success rateTool failures cascade into agent failures
QualityGroundedness, relevance, safety scoresQuality declining without code changes = drift
CostTokens per request, cost per conversationUnexpected cost increases signal inefficiency
SafetyContent filter trigger rateIncreasing triggers may indicate misuse or drift

Error analysis framework

When an agent fails, follow this investigation order:

StepWhat to CheckCommon Findings
1. Trace the requestFollow the full request through Foundry tracingIdentifies which step failed
2. Check tool callsDid tools execute correctly?API timeouts, malformed parameters, auth failures
3. Check retrievalWas the right context retrieved?Stale index, poor search relevance
4. Check reasoningDid the model reason correctly?Wrong tool selection, poor planning
5. Check safetyDid content filters block the response?False positive on legitimate content
6. Check contextWas conversation history too long/corrupted?Context window overflow, memory issues
Common agent failures and their causes
FeatureSymptomLikely CauseInvestigation Step
Agent returns 'I cannot help with that'Unexpected responseContent filter false positive or missing toolCheck safety filters and tool availability
Agent gives wrong informationQuality issueStale index or poor retrievalCheck search index health and relevance
Agent takes wrong actionReasoning errorAmbiguous tool schemas or instructionsReview tool schemas and system prompt
Agent times outPerformance issueTool API timeout or overloaded modelCheck tool latency and model capacity
Agent loops endlesslyPlanning failureCircular tool calls or missing termination conditionReview orchestration logic and add iteration limits
ℹ️ Real-world example: Kai's debugging session

Kai gets a report that the shipping assistant is giving wrong delivery estimates. Investigation:

  1. Trace: Finds the agent is calling estimate_delivery tool correctly
  2. Tool check: Tool is returning data from last month’s rate table
  3. Root cause: The rate table API was updated but the agent’s tool wasn’t reconfigured to point to the new endpoint
  4. Fix: Update the tool’s API endpoint, add a monitoring alert for rate table freshness
  5. Prevention: Add an automated test that verifies tool responses match expected schema

Total investigation time: 20 minutes using Foundry tracing. Without tracing, this could have taken days.

💡 Exam tip: Error analysis order

The exam may ask “where should you start investigating?” Common pattern:

  1. Tools first — most agent failures are tool failures (API down, wrong params, auth expired)
  2. Retrieval second — stale data is the second most common cause
  3. Model reasoning third — the model itself is usually not the problem

Start from the outside (tools) and work inward (model). Don’t blame the model first.

Key terms

Question

What is Foundry tracing?

Click or press Enter to reveal answer

Answer

A built-in observability feature that records every step of an agent's execution — model calls, tool invocations, retrieval queries, and responses. Essential for debugging agent failures and understanding behaviour.

Click to flip back

Question

What is an agent loop (infinite loop)?

Click or press Enter to reveal answer

Answer

When an agent gets stuck in a cycle of tool calls without making progress. Common causes: circular dependencies between tools, missing termination conditions, or conflicting instructions. Prevented by setting iteration limits.

Click to flip back

Question

What is P95 latency?

Click or press Enter to reveal answer

Answer

The 95th percentile response time — 95% of requests complete faster than this value. Used as an SLA metric because it captures the 'typical worst case' that real users experience, excluding extreme outliers.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial's compliance agent suddenly starts returning 'I cannot assist with that request' for legitimate compliance queries. No code changes were deployed. What should they investigate first?

Knowledge Check

NeuralMed notices their patient agent's average tokens per conversation has doubled over the past week, increasing costs. Usage patterns haven't changed. What's the most likely cause?

🎬 Video coming soon

← Previous

Multi-Agent Orchestration & Safeguards

Next →

Prompt Engineering & Model Tuning

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.