🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 2
Domain 2 — Module 11 of 11 100%
19 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 2: Implement Generative AI and Agentic Solutions Premium ⏱ ~12 min read

Observability & Production Operations

Production AI systems need eyes everywhere. Learn how to set up tracing, token analytics, safety signal monitoring, latency tracking, and how to orchestrate multiple models and hybrid engines.

Observability for AI systems

☕ Simple explanation

Observability is like having CCTV, speed cameras, and a dashboard for your AI system — you can see everything that’s happening, catch problems early, and know exactly where things went wrong.

Without observability, your AI is a black box. Users complain about slow responses, but you don’t know why. Costs spike, but you don’t know what’s causing it. Quality drops, but you can’t trace it back to a specific change.

AI observability in Foundry provides four pillars of visibility:

  • Tracing — end-to-end request tracking through model calls, tool invocations, and retrieval
  • Token analytics — input/output token counts, cost per request, cost trends
  • Safety signals — content filter triggers, blocked requests, safety event rates
  • Latency breakdowns — time spent in each pipeline stage (retrieval, model inference, tool execution)

The four observability pillars

PillarWhat It ShowsKey Metrics
TracingFull request journey through the systemTrace ID, span hierarchy, error annotations
Token analyticsToken consumption and costTokens per request, cost per conversation, daily/weekly trends
Safety signalsContent moderation activityFilter trigger rate, blocked request %, categories triggered
Latency breakdownTime spent in each stageModel inference time, tool call latency, retrieval time, total E2E

Implementing tracing

Tracing records every step of a request:

Trace ComponentWhat It CapturesExample
Root spanThe entire request lifecycleUser sends “What’s our refund policy?”
Retrieval spanTime and results of search queriesAzure AI Search returns 5 documents in 120ms
Model spanLLM inference time, token countsGPT-4o processes 2,400 input tokens, generates 350 output tokens in 1.2s
Tool spanExternal function executionverify_customer(id) returns in 80ms
Error annotationAny failures along the wayTool timeout, safety filter block, rate limit hit
💡 Exam tip: Where latency hides

The exam may ask about optimising latency. Common bottlenecks:

  • Retrieval — complex search queries, large indexes, cross-region search
  • Model inference — large prompts, verbose system prompts, high max_tokens
  • Tool calls — slow external APIs, sequential calls that could be parallel
  • Network — cross-region hops between services

Tracing breaks down exactly where time is spent, so you fix the right bottleneck.

Orchestrating multiple models

Production systems often use more than one model. Orchestration patterns:

Multi-model orchestration patterns
FeaturePatternHow It WorksUse Case
Model RouterRoute to best model per requestAutomatic cost-performance optimisationVariable complexity workloads
CascadeTry cheap model first, escalate if neededSLM handles simple, LLM handles complexCost optimisation with quality guarantee
EnsembleRun multiple models, combine resultsMultiple opinions improve accuracyHigh-stakes decisions needing consensus
Hybrid LLM + RulesLLM handles reasoning, rules engine handles logicCombine AI flexibility with deterministic rulesCompliance: rules for hard constraints, LLM for nuance

Hybrid LLM + rules engines

ComponentHandlesExample
Rules engineDeterministic business logic”Loans over $1M require senior approval” — no AI needed
LLMNuanced reasoning and judgment”Evaluate whether the applicant’s explanation for the credit gap is reasonable”
IntegrationRules engine pre-filters, LLM processes what’s leftRules check hard requirements first, LLM evaluates soft factors
ℹ️ Real-world example: Atlas Financial's hybrid system

Atlas Financial’s loan processing uses hybrid orchestration:

Rules engine (deterministic):

  • Credit score below 500 → auto-reject (no LLM needed)
  • Loan amount exceeds policy limit → auto-reject
  • Missing required documents → return to applicant
  • All hard checks pass → forward to LLM analysis

LLM (reasoning):

  • Evaluate employment stability explanation
  • Assess credit gap reasoning
  • Compare to similar approved applications
  • Generate risk assessment narrative

Why hybrid? The rules engine handles 40% of applications instantly (clear pass or fail). The LLM only processes the 60% that need judgment — saving tokens and cost while maintaining deterministic compliance for clear-cut cases.

Key terms

Question

What is a trace in AI observability?

Click or press Enter to reveal answer

Answer

An end-to-end record of a request's journey through the AI system — every model call, tool invocation, retrieval query, and response. Each step is a 'span' within the trace. Used for debugging, performance analysis, and auditing.

Click to flip back

Question

What is a model cascade?

Click or press Enter to reveal answer

Answer

An orchestration pattern where a cheap/fast model handles requests first, escalating to a more capable model only when needed. Example: Phi-4 handles simple queries, GPT-4o handles complex ones. Optimises cost while maintaining quality.

Click to flip back

Question

What is a hybrid LLM + rules engine?

Click or press Enter to reveal answer

Answer

An architecture that combines deterministic business rules with AI reasoning. Rules handle clear-cut logic (hard constraints), while the LLM handles nuanced judgment. Common in compliance, finance, and healthcare.

Click to flip back

Knowledge check

Knowledge Check

Kai notices that the logistics chatbot's average response time has increased from 2 seconds to 8 seconds, but the model inference time hasn't changed. What should he investigate using tracing?

Knowledge Check

Atlas Financial processes 100,000 loan applications monthly. 40% are clear approvals or rejections based on simple criteria (credit score, income ratio). 60% need complex analysis. Which orchestration pattern minimises cost?

🎬 Video coming soon

← Previous

Prompt Engineering & Model Tuning

Next →

Image & Video Generation

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.