Observability & Production Operations

Observability for AI systems

Simple explanation

Observability is like having CCTV, speed cameras, and a dashboard for your AI system — you can see everything that’s happening, catch problems early, and know exactly where things went wrong.

Without observability, your AI is a black box. Users complain about slow responses, but you don’t know why. Costs spike, but you don’t know what’s causing it. Quality drops, but you can’t trace it back to a specific change.

The four observability pillars

Pillar	What It Shows	Key Metrics
Tracing	Full request journey through the system	Trace ID, span hierarchy, error annotations
Token analytics	Token consumption and cost	Tokens per request, cost per conversation, daily/weekly trends
Safety signals	Content moderation activity	Filter trigger rate, blocked request %, categories triggered
Latency breakdown	Time spent in each stage	Model inference time, tool call latency, retrieval time, total E2E

Implementing tracing

Tracing records every step of a request:

Trace Component	What It Captures	Example
Root span	The entire request lifecycle	User sends “What’s our refund policy?”
Retrieval span	Time and results of search queries	Azure AI Search returns 5 documents in 120ms
Model span	LLM inference time, token counts	GPT-4o processes 2,400 input tokens, generates 350 output tokens in 1.2s
Tool span	External function execution	`verify_customer(id)` returns in 80ms
Error annotation	Any failures along the way	Tool timeout, safety filter block, rate limit hit

Exam tip: Where latency hides

The exam may ask about optimising latency. Common bottlenecks:

Retrieval — complex search queries, large indexes, cross-region search
Model inference — large prompts, verbose system prompts, high max_tokens
Tool calls — slow external APIs, sequential calls that could be parallel
Network — cross-region hops between services

Tracing breaks down exactly where time is spent, so you fix the right bottleneck.

Orchestrating multiple models

Production systems often use more than one model. Orchestration patterns:

Multi-model orchestration patterns
Feature	Pattern	How It Works	Use Case
Model Router	Route to best model per request	Automatic cost-performance optimisation	Variable complexity workloads
Cascade	Try cheap model first, escalate if needed	SLM handles simple, LLM handles complex	Cost optimisation with quality guarantee
Ensemble	Run multiple models, combine results	Multiple opinions improve accuracy	High-stakes decisions needing consensus
Hybrid LLM + Rules	LLM handles reasoning, rules engine handles logic	Combine AI flexibility with deterministic rules	Compliance: rules for hard constraints, LLM for nuance

Hybrid LLM + rules engines

Component	Handles	Example
Rules engine	Deterministic business logic	”Loans over $1M require senior approval” — no AI needed
LLM	Nuanced reasoning and judgment	”Evaluate whether the applicant’s explanation for the credit gap is reasonable”
Integration	Rules engine pre-filters, LLM processes what’s left	Rules check hard requirements first, LLM evaluates soft factors

Real-world example: Atlas Financial's hybrid system

Atlas Financial’s loan processing uses hybrid orchestration:

Rules engine (deterministic):

Credit score below 500 → auto-reject (no LLM needed)
Loan amount exceeds policy limit → auto-reject
Missing required documents → return to applicant
All hard checks pass → forward to LLM analysis

LLM (reasoning):

Evaluate employment stability explanation
Assess credit gap reasoning
Compare to similar approved applications
Generate risk assessment narrative

Why hybrid? The rules engine handles 40% of applications instantly (clear pass or fail). The LLM only processes the 60% that need judgment — saving tokens and cost while maintaining deterministic compliance for clear-cut cases.

Key terms

Question

What is a trace in AI observability?

Click or press Enter to reveal answer

Answer

An end-to-end record of a request's journey through the AI system — every model call, tool invocation, retrieval query, and response. Each step is a 'span' within the trace. Used for debugging, performance analysis, and auditing.

Click to flip back

Question

What is a model cascade?

Click or press Enter to reveal answer

Answer

An orchestration pattern where a cheap/fast model handles requests first, escalating to a more capable model only when needed. Example: Phi-4 handles simple queries, GPT-4o handles complex ones. Optimises cost while maintaining quality.

Click to flip back

Question

What is a hybrid LLM + rules engine?

Click or press Enter to reveal answer

Answer

An architecture that combines deterministic business rules with AI reasoning. Rules handle clear-cut logic (hard constraints), while the LLM handles nuanced judgment. Common in compliance, finance, and healthcare.

Click to flip back

Knowledge check

Knowledge Check

Kai notices that the logistics chatbot's average response time has increased from 2 seconds to 8 seconds, but the model inference time hasn't changed. What should he investigate using tracing?

Knowledge Check

Atlas Financial processes 100,000 loan applications monthly. 40% are clear approvals or rejections based on simple criteria (credit score, income ratio). 60% need complex analysis. Which orchestration pattern minimises cost?