Observability & Production Operations
Production AI systems need eyes everywhere. Learn how to set up tracing, token analytics, safety signal monitoring, latency tracking, and how to orchestrate multiple models and hybrid engines.
Observability for AI systems
Observability is like having CCTV, speed cameras, and a dashboard for your AI system — you can see everything that’s happening, catch problems early, and know exactly where things went wrong.
Without observability, your AI is a black box. Users complain about slow responses, but you don’t know why. Costs spike, but you don’t know what’s causing it. Quality drops, but you can’t trace it back to a specific change.
The four observability pillars
| Pillar | What It Shows | Key Metrics |
|---|---|---|
| Tracing | Full request journey through the system | Trace ID, span hierarchy, error annotations |
| Token analytics | Token consumption and cost | Tokens per request, cost per conversation, daily/weekly trends |
| Safety signals | Content moderation activity | Filter trigger rate, blocked request %, categories triggered |
| Latency breakdown | Time spent in each stage | Model inference time, tool call latency, retrieval time, total E2E |
Implementing tracing
Tracing records every step of a request:
| Trace Component | What It Captures | Example |
|---|---|---|
| Root span | The entire request lifecycle | User sends “What’s our refund policy?” |
| Retrieval span | Time and results of search queries | Azure AI Search returns 5 documents in 120ms |
| Model span | LLM inference time, token counts | GPT-4o processes 2,400 input tokens, generates 350 output tokens in 1.2s |
| Tool span | External function execution | verify_customer(id) returns in 80ms |
| Error annotation | Any failures along the way | Tool timeout, safety filter block, rate limit hit |
Exam tip: Where latency hides
The exam may ask about optimising latency. Common bottlenecks:
- Retrieval — complex search queries, large indexes, cross-region search
- Model inference — large prompts, verbose system prompts, high max_tokens
- Tool calls — slow external APIs, sequential calls that could be parallel
- Network — cross-region hops between services
Tracing breaks down exactly where time is spent, so you fix the right bottleneck.
Orchestrating multiple models
Production systems often use more than one model. Orchestration patterns:
| Feature | Pattern | How It Works | Use Case |
|---|---|---|---|
| Model Router | Route to best model per request | Automatic cost-performance optimisation | Variable complexity workloads |
| Cascade | Try cheap model first, escalate if needed | SLM handles simple, LLM handles complex | Cost optimisation with quality guarantee |
| Ensemble | Run multiple models, combine results | Multiple opinions improve accuracy | High-stakes decisions needing consensus |
| Hybrid LLM + Rules | LLM handles reasoning, rules engine handles logic | Combine AI flexibility with deterministic rules | Compliance: rules for hard constraints, LLM for nuance |
Hybrid LLM + rules engines
| Component | Handles | Example |
|---|---|---|
| Rules engine | Deterministic business logic | ”Loans over $1M require senior approval” — no AI needed |
| LLM | Nuanced reasoning and judgment | ”Evaluate whether the applicant’s explanation for the credit gap is reasonable” |
| Integration | Rules engine pre-filters, LLM processes what’s left | Rules check hard requirements first, LLM evaluates soft factors |
Real-world example: Atlas Financial's hybrid system
Atlas Financial’s loan processing uses hybrid orchestration:
Rules engine (deterministic):
- Credit score below 500 → auto-reject (no LLM needed)
- Loan amount exceeds policy limit → auto-reject
- Missing required documents → return to applicant
- All hard checks pass → forward to LLM analysis
LLM (reasoning):
- Evaluate employment stability explanation
- Assess credit gap reasoning
- Compare to similar approved applications
- Generate risk assessment narrative
Why hybrid? The rules engine handles 40% of applications instantly (clear pass or fail). The LLM only processes the 60% that need judgment — saving tokens and cost while maintaining deterministic compliance for clear-cut cases.
Key terms
Knowledge check
Kai notices that the logistics chatbot's average response time has increased from 2 seconds to 8 seconds, but the model inference time hasn't changed. What should he investigate using tracing?
Atlas Financial processes 100,000 loan applications monthly. 40% are clear approvals or rejections based on simple criteria (credit score, income ratio). 60% need complex analysis. Which orchestration pattern minimises cost?
🎬 Video coming soon