Monitoring GenAI in Production

Why monitor GenAI differently?

Simple explanation

Monitoring is like security cameras in a restaurant.

You don’t wait for the health inspector to tell you there’s a problem — you watch in real-time. Are orders taking too long? Is the kitchen backed up? Did someone slip in the hallway?

GenAI monitoring works the same way. Your app is live, customers are using it. You need to see: How fast are responses? Are errors spiking? Is the model suddenly slower? Did costs jump overnight?

Without monitoring, you only find out about problems when users complain — and by then, hundreds of bad experiences have already happened.

Key operational metrics

These are the metrics you watch in real-time on your production dashboard:

Key operational metrics for GenAI production systems
Feature	What It Measures	Healthy Range	Alert When
Latency (P50)	Median response time	Under 2 seconds	Exceeds 3 seconds
Latency (P95)	95th percentile response time	Under 5 seconds	Exceeds 8 seconds
Latency (P99)	99th percentile — worst-case users	Under 10 seconds	Exceeds 15 seconds
Throughput (RPS)	Requests per second handled	Matches expected traffic	Drops below baseline by 20%
Error Rate	Percentage of failed requests	Under 0.1%	Exceeds 1%
Availability	Uptime percentage	99.9%+	Falls below 99.5%

Why percentiles matter more than averages

Average latency hides problems. If 95% of requests take 1 second but 5% take 30 seconds, your average is 2.5 seconds — looks fine! But 5% of your users are having a terrible experience.

Percentile breakdown:

P50 (median) — half your users are faster than this
P95 — 95% of users are faster; the remaining 5% are slower
P99 — only 1 in 100 users experiences worse latency than this

The exam focuses on P95 as the primary latency metric for production systems.

Exam tip: P95 over average, degradation rate over absolute

Two key exam patterns for monitoring:

Use P95 latency, not average — average masks tail latency issues. If asked “which metric best represents user experience,” choose P95.
Alert on degradation rate, not absolute thresholds — if your normal P95 is 3s and it jumps to 6s, that’s a 100% degradation even though 6s might seem “acceptable” in isolation. Relative change catches problems faster than fixed thresholds.

Continuous monitoring in Azure AI Foundry

Azure AI Foundry provides a monitoring dashboard (preview) that tracks both operational and quality metrics from production traffic. For production monitoring, combine the Foundry dashboard (preview) with Application Insights for comprehensive observability.

Setting up monitoring

Enable data collection — configure your deployed endpoint to log requests and responses
Configure sampling — for high-traffic apps, sample a percentage of requests for quality evaluation (running quality evaluators on every request is too expensive)
Set up scheduled evaluations — run quality metrics on sampled data at regular intervals (hourly, daily)
Configure alerts — define thresholds for both operational and quality metric degradation

What gets monitored

Layer	Metrics	Source
Infrastructure	CPU/GPU utilisation, memory, network	Azure Monitor
Endpoint	Latency, throughput, error rate, availability	Application Insights
Model quality	Groundedness, relevance, coherence (sampled)	Azure AI Foundry evaluation
Safety	Content safety flags, jailbreak attempts	Azure AI Content Safety

Scenario: Kai monitors NeuralSpark's support bot during peak hours

Kai Nakamura deployed NeuralSpark’s customer support bot last week. It’s now handling 500 requests per hour during business hours. Priya (CTO) wants visibility into performance.

Kai sets up monitoring:

Operational dashboard: P50, P95, P99 latency, error rate, throughput — refreshes every minute
Quality sampling: 10% of requests get quality evaluation (groundedness + relevance) — runs hourly
Alerts configured:
- P95 latency exceeds 5s → Slack notification to on-call
- Error rate exceeds 1% → PagerDuty alert
- Groundedness drops below 3.5 → email to ML team

On Wednesday at 2pm, the P95 latency alert fires — latency jumped from 3s to 8s. Kai investigates and finds the Azure OpenAI endpoint is throttled because another team’s batch job consumed the shared quota. He implements per-application rate limits to prevent recurrence.

Application Insights integration

Application Insights is the primary telemetry tool for GenAI tracing in Azure. It captures:

Request telemetry — every API call with timing, status code, and size
Dependency telemetry — calls to Azure OpenAI, AI Search, and other downstream services
Custom events — token counts, model version, prompt template used
Exceptions — errors with full stack traces

from opentelemetry import trace
from azure.monitor.opentelemetry import configure_azure_monitor

# Enable Application Insights tracing
configure_azure_monitor(
    connection_string="InstrumentationKey=your-key-here"
)

tracer = trace.get_tracer(__name__)

# Trace a GenAI request
with tracer.start_as_current_span("genai-request") as span:
    span.set_attribute("genai.model", "gpt-4o")
    span.set_attribute("genai.prompt_tokens", 150)
    span.set_attribute("genai.completion_tokens", 320)
    span.set_attribute("genai.total_tokens", 470)

    response = call_model(prompt)

    span.set_attribute("genai.finish_reason", "stop")

What’s happening:

Lines 5-7: Configure Application Insights as the OpenTelemetry exporter — all traces flow to your dashboard
Line 9: Create a tracer for your application
Lines 12-20: Wrap each GenAI call in a span — recording model name, token counts, and finish reason
This gives you per-request visibility: which requests are slow, which use the most tokens, which error

Setting up dashboards and alerts

Key dashboard panels

A well-designed GenAI monitoring dashboard includes:

Panel	Visualisation	Time Range
Latency distribution	Histogram with P50/P95/P99 lines	Last 1 hour
Error rate	Line chart with threshold line	Last 24 hours
Throughput	Stacked area (by endpoint)	Last 24 hours
Token consumption	Bar chart (input vs output tokens)	Last 7 days
Quality scores	Trend line (groundedness, relevance)	Last 7 days
Safety flags	Count of flagged responses	Last 24 hours

Alert best practices

Latency: Alert on P95 degradation (not absolute). If P95 increases by 50% compared to the 7-day baseline, fire an alert
Error rate: Alert when error rate exceeds 2x the normal baseline
Quality: Alert when sampled groundedness drops below the threshold you set during evaluation
Token cost: Alert when daily token spend exceeds 120% of the 7-day average

Exam tip: Sampling for quality monitoring

Running quality evaluators (groundedness, relevance) on every production request is prohibitively expensive — each evaluation requires an LLM judge call. Production monitoring uses sampling:

High-traffic apps: evaluate 1-5% of requests
Medium-traffic apps: evaluate 10-20% of requests
Low-traffic or high-risk apps: evaluate 100% of requests

The exam may ask about the trade-off between evaluation coverage and cost. The answer: sample, and increase sampling rate for critical or high-risk applications.

Key terms flashcards

Question

Why use P95 latency instead of average latency?

Click or press Enter to reveal answer

Answer

Average latency masks tail latency issues. P95 shows the experience of your slowest 5% of users — the ones most likely to churn. If your P95 is bad, 1 in 20 users is suffering.

Click to flip back

Question

What are the four monitoring layers for GenAI?

Click or press Enter to reveal answer

Answer

Infrastructure (CPU/GPU/memory), Endpoint (latency/throughput/errors), Model Quality (groundedness/relevance sampled from production), and Safety (content flags, jailbreak attempts).

Click to flip back

Question

Why sample for quality monitoring instead of evaluating every request?

Click or press Enter to reveal answer

Answer

Quality evaluation requires an LLM judge call for each request — this doubles cost and latency. Sampling (1-20% of requests) provides quality trends without the expense. Increase sampling for high-risk applications.

Click to flip back

Question

What does Application Insights capture for GenAI?

Click or press Enter to reveal answer

Answer

Request telemetry (timing, status), dependency calls (Azure OpenAI, AI Search), custom events (token counts, model version), and exceptions. Uses OpenTelemetry for distributed tracing across multi-step workflows.

Click to flip back

Knowledge check

Knowledge Check

Kai's monitoring dashboard shows: P50 latency = 1.2s, P95 = 4.8s, P99 = 25s, Average = 2.1s. The SLA requires 95% of requests under 5 seconds. Is the system meeting its SLA?

Knowledge Check

Dr. Fatima wants to monitor the quality of Meridian's production chatbot but the system handles 10,000 requests per hour. Running quality evaluators on every request would double her Azure AI costs. What should she do?

Next up: Cost Tracking, Logging & Debugging — because GenAI costs scale with every token.