Monitoring GenAI in Production
Your GenAI app is live. Now keep it healthy. Learn to monitor latency, throughput, response times, and set up continuous monitoring in Foundry for production reliability.
Why monitor GenAI differently?
Monitoring is like security cameras in a restaurant.
You donβt wait for the health inspector to tell you thereβs a problem β you watch in real-time. Are orders taking too long? Is the kitchen backed up? Did someone slip in the hallway?
GenAI monitoring works the same way. Your app is live, customers are using it. You need to see: How fast are responses? Are errors spiking? Is the model suddenly slower? Did costs jump overnight?
Without monitoring, you only find out about problems when users complain β and by then, hundreds of bad experiences have already happened.
Key operational metrics
These are the metrics you watch in real-time on your production dashboard:
| Feature | What It Measures | Healthy Range | Alert When |
|---|---|---|---|
| Latency (P50) | Median response time | Under 2 seconds | Exceeds 3 seconds |
| Latency (P95) | 95th percentile response time | Under 5 seconds | Exceeds 8 seconds |
| Latency (P99) | 99th percentile β worst-case users | Under 10 seconds | Exceeds 15 seconds |
| Throughput (RPS) | Requests per second handled | Matches expected traffic | Drops below baseline by 20% |
| Error Rate | Percentage of failed requests | Under 0.1% | Exceeds 1% |
| Availability | Uptime percentage | 99.9%+ | Falls below 99.5% |
Why percentiles matter more than averages
Average latency hides problems. If 95% of requests take 1 second but 5% take 30 seconds, your average is 2.5 seconds β looks fine! But 5% of your users are having a terrible experience.
Percentile breakdown:
- P50 (median) β half your users are faster than this
- P95 β 95% of users are faster; the remaining 5% are slower
- P99 β only 1 in 100 users experiences worse latency than this
The exam focuses on P95 as the primary latency metric for production systems.
Exam tip: P95 over average, degradation rate over absolute
Two key exam patterns for monitoring:
-
Use P95 latency, not average β average masks tail latency issues. If asked βwhich metric best represents user experience,β choose P95.
-
Alert on degradation rate, not absolute thresholds β if your normal P95 is 3s and it jumps to 6s, thatβs a 100% degradation even though 6s might seem βacceptableβ in isolation. Relative change catches problems faster than fixed thresholds.
Continuous monitoring in Azure AI Foundry
Azure AI Foundry provides a monitoring dashboard (preview) that tracks both operational and quality metrics from production traffic. For production monitoring, combine the Foundry dashboard (preview) with Application Insights for comprehensive observability.
Setting up monitoring
- Enable data collection β configure your deployed endpoint to log requests and responses
- Configure sampling β for high-traffic apps, sample a percentage of requests for quality evaluation (running quality evaluators on every request is too expensive)
- Set up scheduled evaluations β run quality metrics on sampled data at regular intervals (hourly, daily)
- Configure alerts β define thresholds for both operational and quality metric degradation
What gets monitored
| Layer | Metrics | Source |
|---|---|---|
| Infrastructure | CPU/GPU utilisation, memory, network | Azure Monitor |
| Endpoint | Latency, throughput, error rate, availability | Application Insights |
| Model quality | Groundedness, relevance, coherence (sampled) | Azure AI Foundry evaluation |
| Safety | Content safety flags, jailbreak attempts | Azure AI Content Safety |
Scenario: Kai monitors NeuralSpark's support bot during peak hours
Kai Nakamura deployed NeuralSparkβs customer support bot last week. Itβs now handling 500 requests per hour during business hours. Priya (CTO) wants visibility into performance.
Kai sets up monitoring:
- Operational dashboard: P50, P95, P99 latency, error rate, throughput β refreshes every minute
- Quality sampling: 10% of requests get quality evaluation (groundedness + relevance) β runs hourly
- Alerts configured:
- P95 latency exceeds 5s β Slack notification to on-call
- Error rate exceeds 1% β PagerDuty alert
- Groundedness drops below 3.5 β email to ML team
On Wednesday at 2pm, the P95 latency alert fires β latency jumped from 3s to 8s. Kai investigates and finds the Azure OpenAI endpoint is throttled because another teamβs batch job consumed the shared quota. He implements per-application rate limits to prevent recurrence.
Application Insights integration
Application Insights is the primary telemetry tool for GenAI tracing in Azure. It captures:
- Request telemetry β every API call with timing, status code, and size
- Dependency telemetry β calls to Azure OpenAI, AI Search, and other downstream services
- Custom events β token counts, model version, prompt template used
- Exceptions β errors with full stack traces
from opentelemetry import trace
from azure.monitor.opentelemetry import configure_azure_monitor
# Enable Application Insights tracing
configure_azure_monitor(
connection_string="InstrumentationKey=your-key-here"
)
tracer = trace.get_tracer(__name__)
# Trace a GenAI request
with tracer.start_as_current_span("genai-request") as span:
span.set_attribute("genai.model", "gpt-4o")
span.set_attribute("genai.prompt_tokens", 150)
span.set_attribute("genai.completion_tokens", 320)
span.set_attribute("genai.total_tokens", 470)
response = call_model(prompt)
span.set_attribute("genai.finish_reason", "stop")
Whatβs happening:
- Lines 5-7: Configure Application Insights as the OpenTelemetry exporter β all traces flow to your dashboard
- Line 9: Create a tracer for your application
- Lines 12-20: Wrap each GenAI call in a span β recording model name, token counts, and finish reason
- This gives you per-request visibility: which requests are slow, which use the most tokens, which error
Setting up dashboards and alerts
Key dashboard panels
A well-designed GenAI monitoring dashboard includes:
| Panel | Visualisation | Time Range |
|---|---|---|
| Latency distribution | Histogram with P50/P95/P99 lines | Last 1 hour |
| Error rate | Line chart with threshold line | Last 24 hours |
| Throughput | Stacked area (by endpoint) | Last 24 hours |
| Token consumption | Bar chart (input vs output tokens) | Last 7 days |
| Quality scores | Trend line (groundedness, relevance) | Last 7 days |
| Safety flags | Count of flagged responses | Last 24 hours |
Alert best practices
- Latency: Alert on P95 degradation (not absolute). If P95 increases by 50% compared to the 7-day baseline, fire an alert
- Error rate: Alert when error rate exceeds 2x the normal baseline
- Quality: Alert when sampled groundedness drops below the threshold you set during evaluation
- Token cost: Alert when daily token spend exceeds 120% of the 7-day average
Exam tip: Sampling for quality monitoring
Running quality evaluators (groundedness, relevance) on every production request is prohibitively expensive β each evaluation requires an LLM judge call. Production monitoring uses sampling:
- High-traffic apps: evaluate 1-5% of requests
- Medium-traffic apps: evaluate 10-20% of requests
- Low-traffic or high-risk apps: evaluate 100% of requests
The exam may ask about the trade-off between evaluation coverage and cost. The answer: sample, and increase sampling rate for critical or high-risk applications.
Key terms flashcards
Knowledge check
Kai's monitoring dashboard shows: P50 latency = 1.2s, P95 = 4.8s, P99 = 25s, Average = 2.1s. The SLA requires 95% of requests under 5 seconds. Is the system meeting its SLA?
Dr. Fatima wants to monitor the quality of Meridian's production chatbot but the system handles 10,000 requests per hour. Running quality evaluators on every request would double her Azure AI costs. What should she do?
π¬ Video coming soon
Next up: Cost Tracking, Logging & Debugging β because GenAI costs scale with every token.