πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 4
Domain 4 β€” Module 3 of 4 75%
21 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 4: Implement Generative AI Quality Assurance and Observability Premium ⏱ ~13 min read

Monitoring GenAI in Production

Your GenAI app is live. Now keep it healthy. Learn to monitor latency, throughput, response times, and set up continuous monitoring in Foundry for production reliability.

Why monitor GenAI differently?

β˜• Simple explanation

Monitoring is like security cameras in a restaurant.

You don’t wait for the health inspector to tell you there’s a problem β€” you watch in real-time. Are orders taking too long? Is the kitchen backed up? Did someone slip in the hallway?

GenAI monitoring works the same way. Your app is live, customers are using it. You need to see: How fast are responses? Are errors spiking? Is the model suddenly slower? Did costs jump overnight?

Without monitoring, you only find out about problems when users complain β€” and by then, hundreds of bad experiences have already happened.

GenAI production monitoring differs from traditional application monitoring in several ways:

  • Non-determinism β€” the same input can produce different outputs, making it harder to define β€œcorrect behaviour”
  • Quality degradation β€” models can drift in quality without throwing errors
  • Cost sensitivity β€” every request costs tokens, making runaway usage a financial risk
  • Latency variability β€” response times depend on prompt length, model load, and generation length

Azure AI Foundry provides continuous monitoring that combines operational metrics (latency, throughput, errors) with quality metrics (groundedness, relevance) sampled from production traffic.

Key operational metrics

These are the metrics you watch in real-time on your production dashboard:

Key operational metrics for GenAI production systems
FeatureWhat It MeasuresHealthy RangeAlert When
Latency (P50)Median response timeUnder 2 secondsExceeds 3 seconds
Latency (P95)95th percentile response timeUnder 5 secondsExceeds 8 seconds
Latency (P99)99th percentile β€” worst-case usersUnder 10 secondsExceeds 15 seconds
Throughput (RPS)Requests per second handledMatches expected trafficDrops below baseline by 20%
Error RatePercentage of failed requestsUnder 0.1%Exceeds 1%
AvailabilityUptime percentage99.9%+Falls below 99.5%

Why percentiles matter more than averages

Average latency hides problems. If 95% of requests take 1 second but 5% take 30 seconds, your average is 2.5 seconds β€” looks fine! But 5% of your users are having a terrible experience.

Percentile breakdown:

  • P50 (median) β€” half your users are faster than this
  • P95 β€” 95% of users are faster; the remaining 5% are slower
  • P99 β€” only 1 in 100 users experiences worse latency than this

The exam focuses on P95 as the primary latency metric for production systems.

πŸ’‘ Exam tip: P95 over average, degradation rate over absolute

Two key exam patterns for monitoring:

  1. Use P95 latency, not average β€” average masks tail latency issues. If asked β€œwhich metric best represents user experience,” choose P95.

  2. Alert on degradation rate, not absolute thresholds β€” if your normal P95 is 3s and it jumps to 6s, that’s a 100% degradation even though 6s might seem β€œacceptable” in isolation. Relative change catches problems faster than fixed thresholds.

Continuous monitoring in Azure AI Foundry

Azure AI Foundry provides a monitoring dashboard (preview) that tracks both operational and quality metrics from production traffic. For production monitoring, combine the Foundry dashboard (preview) with Application Insights for comprehensive observability.

Setting up monitoring

  1. Enable data collection β€” configure your deployed endpoint to log requests and responses
  2. Configure sampling β€” for high-traffic apps, sample a percentage of requests for quality evaluation (running quality evaluators on every request is too expensive)
  3. Set up scheduled evaluations β€” run quality metrics on sampled data at regular intervals (hourly, daily)
  4. Configure alerts β€” define thresholds for both operational and quality metric degradation

What gets monitored

LayerMetricsSource
InfrastructureCPU/GPU utilisation, memory, networkAzure Monitor
EndpointLatency, throughput, error rate, availabilityApplication Insights
Model qualityGroundedness, relevance, coherence (sampled)Azure AI Foundry evaluation
SafetyContent safety flags, jailbreak attemptsAzure AI Content Safety
Scenario: Kai monitors NeuralSpark's support bot during peak hours

Kai Nakamura deployed NeuralSpark’s customer support bot last week. It’s now handling 500 requests per hour during business hours. Priya (CTO) wants visibility into performance.

Kai sets up monitoring:

  • Operational dashboard: P50, P95, P99 latency, error rate, throughput β€” refreshes every minute
  • Quality sampling: 10% of requests get quality evaluation (groundedness + relevance) β€” runs hourly
  • Alerts configured:
    • P95 latency exceeds 5s β†’ Slack notification to on-call
    • Error rate exceeds 1% β†’ PagerDuty alert
    • Groundedness drops below 3.5 β†’ email to ML team

On Wednesday at 2pm, the P95 latency alert fires β€” latency jumped from 3s to 8s. Kai investigates and finds the Azure OpenAI endpoint is throttled because another team’s batch job consumed the shared quota. He implements per-application rate limits to prevent recurrence.

Application Insights integration

Application Insights is the primary telemetry tool for GenAI tracing in Azure. It captures:

  • Request telemetry β€” every API call with timing, status code, and size
  • Dependency telemetry β€” calls to Azure OpenAI, AI Search, and other downstream services
  • Custom events β€” token counts, model version, prompt template used
  • Exceptions β€” errors with full stack traces
from opentelemetry import trace
from azure.monitor.opentelemetry import configure_azure_monitor

# Enable Application Insights tracing
configure_azure_monitor(
    connection_string="InstrumentationKey=your-key-here"
)

tracer = trace.get_tracer(__name__)

# Trace a GenAI request
with tracer.start_as_current_span("genai-request") as span:
    span.set_attribute("genai.model", "gpt-4o")
    span.set_attribute("genai.prompt_tokens", 150)
    span.set_attribute("genai.completion_tokens", 320)
    span.set_attribute("genai.total_tokens", 470)

    response = call_model(prompt)

    span.set_attribute("genai.finish_reason", "stop")

What’s happening:

  • Lines 5-7: Configure Application Insights as the OpenTelemetry exporter β€” all traces flow to your dashboard
  • Line 9: Create a tracer for your application
  • Lines 12-20: Wrap each GenAI call in a span β€” recording model name, token counts, and finish reason
  • This gives you per-request visibility: which requests are slow, which use the most tokens, which error

Setting up dashboards and alerts

Key dashboard panels

A well-designed GenAI monitoring dashboard includes:

PanelVisualisationTime Range
Latency distributionHistogram with P50/P95/P99 linesLast 1 hour
Error rateLine chart with threshold lineLast 24 hours
ThroughputStacked area (by endpoint)Last 24 hours
Token consumptionBar chart (input vs output tokens)Last 7 days
Quality scoresTrend line (groundedness, relevance)Last 7 days
Safety flagsCount of flagged responsesLast 24 hours

Alert best practices

  • Latency: Alert on P95 degradation (not absolute). If P95 increases by 50% compared to the 7-day baseline, fire an alert
  • Error rate: Alert when error rate exceeds 2x the normal baseline
  • Quality: Alert when sampled groundedness drops below the threshold you set during evaluation
  • Token cost: Alert when daily token spend exceeds 120% of the 7-day average
πŸ’‘ Exam tip: Sampling for quality monitoring

Running quality evaluators (groundedness, relevance) on every production request is prohibitively expensive β€” each evaluation requires an LLM judge call. Production monitoring uses sampling:

  • High-traffic apps: evaluate 1-5% of requests
  • Medium-traffic apps: evaluate 10-20% of requests
  • Low-traffic or high-risk apps: evaluate 100% of requests

The exam may ask about the trade-off between evaluation coverage and cost. The answer: sample, and increase sampling rate for critical or high-risk applications.

Key terms flashcards

Question

Why use P95 latency instead of average latency?

Click or press Enter to reveal answer

Answer

Average latency masks tail latency issues. P95 shows the experience of your slowest 5% of users β€” the ones most likely to churn. If your P95 is bad, 1 in 20 users is suffering.

Click to flip back

Question

What are the four monitoring layers for GenAI?

Click or press Enter to reveal answer

Answer

Infrastructure (CPU/GPU/memory), Endpoint (latency/throughput/errors), Model Quality (groundedness/relevance sampled from production), and Safety (content flags, jailbreak attempts).

Click to flip back

Question

Why sample for quality monitoring instead of evaluating every request?

Click or press Enter to reveal answer

Answer

Quality evaluation requires an LLM judge call for each request β€” this doubles cost and latency. Sampling (1-20% of requests) provides quality trends without the expense. Increase sampling for high-risk applications.

Click to flip back

Question

What does Application Insights capture for GenAI?

Click or press Enter to reveal answer

Answer

Request telemetry (timing, status), dependency calls (Azure OpenAI, AI Search), custom events (token counts, model version), and exceptions. Uses OpenTelemetry for distributed tracing across multi-step workflows.

Click to flip back

Knowledge check

Knowledge Check

Kai's monitoring dashboard shows: P50 latency = 1.2s, P95 = 4.8s, P99 = 25s, Average = 2.1s. The SLA requires 95% of requests under 5 seconds. Is the system meeting its SLA?

Knowledge Check

Dr. Fatima wants to monitor the quality of Meridian's production chatbot but the system handles 10,000 requests per hour. Running quality evaluators on every request would double her Azure AI costs. What should she do?

🎬 Video coming soon


Next up: Cost Tracking, Logging & Debugging β€” because GenAI costs scale with every token.

← Previous

Safety Evaluations & Custom Metrics

Next β†’

Cost Tracking, Logging & Debugging

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.