πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AZ-400 Domain 5
Domain 5 β€” Module 2 of 2 100%
25 of 25 overall

AZ-400 Study Guide

Domain 1: Design and Implement Processes and Communications

  • Work Item Tracking: Boards, GitHub & Flow
  • DevOps Metrics: Dashboards That Drive Decisions
  • Collaboration: Wikis, Teams & Release Notes

Domain 2: Design and Implement a Source Control Strategy

  • Branching Strategies: Trunk-Based, Feature & Release
  • Pull Requests: Policies, Protections & Merge Rules
  • Repository Management: LFS, Permissions & Recovery

Domain 3: Design and Implement Build and Release Pipelines

  • Package Management: Feeds, Versioning & Upstream
  • Testing Strategy: Quality Gates & Release Gates
  • Test Implementation: Code Coverage & Pipeline Tests
  • Azure Pipelines: YAML from Scratch Free
  • GitHub Actions: Workflows from Scratch Free
  • Pipeline Agents: Self-Hosted, Hybrid & VM Templates
  • Multi-Stage Pipelines: Templates, Variables & Approvals
  • Deployment Strategies: Blue-Green, Canary & Ring Free
  • Safe Rollouts: Slots, Dependencies & Hotfix Paths
  • Deployment Implementations: Containers, Scripts & Databases
  • Infrastructure as Code: ARM vs Bicep vs Terraform
  • IaC in Practice: Desired State & Deployment Environments
  • Pipeline Maintenance: Health, Migration & Retention

Domain 4: Develop a Security and Compliance Plan

  • Pipeline Identity: Service Principals, Managed IDs & OIDC Free
  • Authorization & Access: GitHub Roles & Azure DevOps Security
  • Secrets & Secure Pipelines: Key Vault & Workload Federation
  • Security Scanning: GHAS, Defender & Dependabot

Domain 5: Implement an Instrumentation Strategy

  • Monitoring for DevOps: Azure Monitor & App Insights
  • Metrics & KQL: Analysing Telemetry & Traces

AZ-400 Study Guide

Domain 1: Design and Implement Processes and Communications

  • Work Item Tracking: Boards, GitHub & Flow
  • DevOps Metrics: Dashboards That Drive Decisions
  • Collaboration: Wikis, Teams & Release Notes

Domain 2: Design and Implement a Source Control Strategy

  • Branching Strategies: Trunk-Based, Feature & Release
  • Pull Requests: Policies, Protections & Merge Rules
  • Repository Management: LFS, Permissions & Recovery

Domain 3: Design and Implement Build and Release Pipelines

  • Package Management: Feeds, Versioning & Upstream
  • Testing Strategy: Quality Gates & Release Gates
  • Test Implementation: Code Coverage & Pipeline Tests
  • Azure Pipelines: YAML from Scratch Free
  • GitHub Actions: Workflows from Scratch Free
  • Pipeline Agents: Self-Hosted, Hybrid & VM Templates
  • Multi-Stage Pipelines: Templates, Variables & Approvals
  • Deployment Strategies: Blue-Green, Canary & Ring Free
  • Safe Rollouts: Slots, Dependencies & Hotfix Paths
  • Deployment Implementations: Containers, Scripts & Databases
  • Infrastructure as Code: ARM vs Bicep vs Terraform
  • IaC in Practice: Desired State & Deployment Environments
  • Pipeline Maintenance: Health, Migration & Retention

Domain 4: Develop a Security and Compliance Plan

  • Pipeline Identity: Service Principals, Managed IDs & OIDC Free
  • Authorization & Access: GitHub Roles & Azure DevOps Security
  • Secrets & Secure Pipelines: Key Vault & Workload Federation
  • Security Scanning: GHAS, Defender & Dependabot

Domain 5: Implement an Instrumentation Strategy

  • Monitoring for DevOps: Azure Monitor & App Insights
  • Metrics & KQL: Analysing Telemetry & Traces
Domain 5: Implement an Instrumentation Strategy Premium ⏱ ~12 min read

Metrics & KQL: Analysing Telemetry & Traces

Analyse infrastructure metrics, application performance, and distributed traces. Write basic KQL queries to interrogate logs in Azure Monitor and Application Insights.

Why metrics analysis is a DevOps superpower

β˜• Simple explanation

Think of a doctor’s check-up.

The doctor measures your blood pressure, heart rate, temperature, and blood oxygen. Each number on its own tells a small story, but together they paint a complete picture of your health. High blood pressure PLUS high heart rate PLUS fever means something very different from high blood pressure alone.

Metrics analysis in DevOps is the same diagnostic process for your applications. CPU usage, memory consumption, response time, and error rate are your application’s vital signs. Individually they hint at problems. Together β€” and with the right query language (KQL) β€” they tell you exactly what is wrong and where.

This final module brings together everything from the instrumentation domain. You will learn to read infrastructure metrics, interpret application performance telemetry, trace distributed transactions through microservices, and write KQL queries to investigate issues. The AZ-400 exam expects you to interpret metrics and write basic KQL β€” not expert-level queries, but enough to filter, aggregate, and render data from Log Analytics and Application Insights.

  • Infrastructure metrics β€” CPU, memory, disk, and network: what each tells you
  • Application metrics β€” response time, request rate, failure rate, dependency performance
  • Distributed tracing β€” end-to-end transaction tracking in Application Insights
  • KQL fundamentals β€” the query operators you need for the exam

Infrastructure performance indicators

Understanding what each metric means β€” and what it signals when it is abnormal β€” is critical for DevOps engineers. The exam tests your ability to interpret these indicators, not just collect them.

CPU

ObservationWhat It MeansAction
Sustained above 80%Compute-bound workload, possible scaling needScale up (bigger VM) or scale out (more instances)
Spikes correlating with deploymentsNew code may have a performance regressionProfile the application, compare with pre-deployment baseline
Low CPU with slow responsesBottleneck is elsewhere β€” disk, network, or external dependencyInvestigate dependencies, check I/O wait

Memory

ObservationWhat It MeansAction
Steady growth over timeMemory leak β€” objects allocated but not releasedProfile the application, check for unclosed connections or large caches
Sudden spikeLarge request, burst of traffic, or loading a big datasetCheck if this correlates with traffic patterns
OOM (Out of Memory) killsProcess exceeded container or VM memory limitIncrease memory limit, fix the leak, or optimize memory usage

Disk

ObservationWhat It MeansAction
High IOPS with slow responseDisk throughput bottleneckUpgrade to Premium SSD or Ultra Disk, or add caching
Disk queue length increasingMore I/O requests than the disk can handleScale storage tier or optimise I/O patterns (batching, async)
Disk space decreasingLogs, temp files, or data growth filling the diskImplement log rotation, add monitoring alerts at 80%

Network

ObservationWhat It MeansAction
High latency between servicesNetwork congestion, distance between resources, or DNS issuesCo-locate resources, use private endpoints, check NSG rules
Packet lossNetwork infrastructure issues or misconfigured NSGsCheck network health, review NSG flow logs
Bandwidth saturationData transfer exceeding the VM or network tier limitsScale network tier, compress data, or optimise transfer patterns
Question

A VM has low CPU utilisation but application response times are very slow. What should you investigate?

Click or press Enter to reveal answer

Answer

The bottleneck is NOT compute. Investigate: 1. External dependencies β€” are database queries or API calls slow? (Check dependency metrics in App Insights) 2. Disk I/O β€” is the app waiting on disk reads/writes? (Check IOPS and disk queue length) 3. Network β€” is there high latency between services? (Check network metrics) 4. Thread starvation β€” is the app blocked waiting on locks or connections? (Check thread pool metrics)

Click to flip back

Application performance metrics

Application Insights collects four primary performance metrics. Together, they form your application health baseline.

MetricWhat It MeasuresHealthy Range
Server response timeHow long your app takes to respond to requests (P50, P95, P99)Depends on SLA β€” typically under 500ms for APIs
Server request rateNumber of requests per secondBaseline varies β€” watch for unexpected drops (outage) or spikes (attack)
Failed request ratePercentage of requests returning 4xx/5xx status codesUnder 1% for healthy apps
Dependency call durationHow long outgoing calls to databases, APIs, caches takeUnder your SLA minus processing time

The Four Golden Signals (Google SRE)

The exam may reference Google’s Site Reliability Engineering framework. The four golden signals align closely with Application Insights metrics:

Four Golden Signals vs Application Insights
Golden SignalDefinitionApp Insights Metric
LatencyTime to serve a request (distinguish successful vs failed)Server response time (split by success/failure)
TrafficVolume of requests the system handlesServer request rate
ErrorsRate of failed requestsFailed request rate and exception count
SaturationHow full the system is (resource utilisation)CPU, memory, disk metrics via VM/Container Insights
Question

What are the Four Golden Signals from Google SRE, and how do they map to Azure monitoring?

Click or press Enter to reveal answer

Answer

1. Latency β†’ Application Insights server response time. 2. Traffic β†’ Application Insights request rate. 3. Errors β†’ Application Insights failed requests and exceptions. 4. Saturation β†’ Azure Monitor metrics for CPU, memory, disk, and network utilisation. Monitor all four to detect issues before they become outages.

Click to flip back

Distributed tracing in Application Insights

In microservices architectures, a single user action may span multiple services. Distributed tracing tracks a request as it flows through the entire chain.

How it works

  1. The first service generates an operation ID β€” a unique identifier for the end-to-end transaction
  2. Each subsequent service call propagates the operation ID in HTTP headers (traceparent in W3C Trace Context, or Request-Id in legacy format)
  3. Application Insights correlates all telemetry (requests, dependencies, exceptions) with the same operation ID
  4. The Transaction diagnostics view shows the complete chain with timing for each hop

Application Map

The Application Map in Application Insights visualises:

  • All components (your services, databases, external APIs)
  • Call volumes between components (arrow thickness represents traffic)
  • Error rates on each component and connection
  • Average response times

This is invaluable for identifying which service in a chain is the bottleneck.

End-to-end transaction details

When investigating a slow request:

  1. Open the Performance blade in Application Insights
  2. Drill into a slow operation (e.g., GET /api/orders)
  3. Click into a specific slow request to see the end-to-end transaction
  4. The timeline shows every dependency call, their duration, and whether they succeeded or failed
  5. You can immediately see: β€œThe request took 3.2 seconds because the SQL dependency took 2.8 seconds”
Scenario: Jordan traces a slow API response

☁️ Jordan Rivera’s team gets reports that the /api/media/transcode endpoint is slow. The response time P95 jumped from 800ms to 4.5 seconds after yesterday’s deployment.

Jordan’s investigation using Application Insights:

  1. Opens the Performance blade, filters to POST /api/media/transcode
  2. Sees P95 jumped from 800ms to 4.5s at 2pm yesterday β€” exactly when Chen (SRE) deployed version 2.4.1
  3. Drills into a 4.5s transaction in the end-to-end transaction view
  4. Timeline shows:
    • POST /api/media/transcode β†’ 4.5s total
    • Dependency: SELECT on transcoding_jobs table β†’ 12ms (fine)
    • Dependency: POST to storage-api/upload β†’ 4.2s (THE BOTTLENECK)
    • Dependency: PUT to queue/transcode-request β†’ 45ms (fine)
  5. The storage-api call is the problem. Checks the storage-api’s Application Insights β€” a new retry policy in v2.4.1 is retrying on every 409 (conflict) response with exponential backoff.

Fix: Avery (dev lead) updates the retry policy to exclude 409 from retriable status codes. P95 drops back to 850ms.

Question

What does the Application Map in Application Insights show you?

Click or press Enter to reveal answer

Answer

The Application Map visualises all components in your architecture (services, databases, external APIs) with arrows showing call relationships. Each component shows its error rate and average response time. Arrow thickness represents traffic volume. It helps you identify which component in a distributed system is the bottleneck, experiencing errors, or receiving unusual traffic.

Click to flip back

KQL fundamentals for the AZ-400 exam

Kusto Query Language (KQL) is the query language for Azure Monitor Logs and Application Insights. The exam tests basic KQL β€” you do not need expert-level queries, but you must know the core operators.

Essential operators

OperatorPurposeExample
whereFilter rowswhere resultCode == "500"
projectSelect specific columnsproject timestamp, name, duration
summarizeAggregate datasummarize count() by resultCode
extendAdd calculated columnsextend durationSec = duration / 1000
order bySort resultsorder by timestamp desc
take / limitLimit row counttake 100
renderVisualise as chartrender timechart
agoRelative time referencewhere timestamp > ago(24h)
betweenRange filterwhere duration between (1000 .. 5000)
contains / hasString matchingwhere name contains "api"
joinCombine tablesjoin kind=inner (dependencies) on operation_Id

Common Application Insights tables

TableContent
requestsIncoming HTTP requests
dependenciesOutgoing calls (SQL, HTTP, etc.)
exceptionsApplication exceptions
tracesDiagnostic log messages
customEventsCustom business events
customMetricsCustom numeric metrics
pageViewsClient-side page load telemetry

Exam-ready KQL examples

Find the top 10 slowest requests in the last 24 hours:

requests
| where timestamp > ago(24h)
| order by duration desc
| take 10
| project timestamp, name, duration, resultCode

Count requests by status code in the last hour:

requests
| where timestamp > ago(1h)
| summarize count() by resultCode
| order by count_ desc

Average response time per API endpoint, charted over time:

requests
| where timestamp > ago(7d)
| summarize avg(duration) by name, bin(timestamp, 1h)
| render timechart

Find all exceptions related to a specific operation:

exceptions
| where timestamp > ago(24h)
| where operation_Name == "POST /api/orders"
| project timestamp, type, outerMessage, innermostMessage
| order by timestamp desc

Correlate a slow request with its dependencies (join):

requests
| where timestamp > ago(1h) and duration > 3000
| join kind=inner (
    dependencies
    | where timestamp > ago(1h)
  ) on operation_Id
| project requestName = name, requestDuration = duration, depTarget = target, depDuration = duration1
| order by depDuration desc
Question

Write a KQL query to find the count of failed requests (5xx) per hour over the last 7 days.

Click or press Enter to reveal answer

Answer

requests | where timestamp > ago(7d) | where toint(resultCode) >= 500 | summarize failedCount = count() by bin(timestamp, 1h) | render timechart Key operators: 'where' to filter, 'toint()' to convert resultCode string to number, 'summarize count() by bin()' to bucket by time, 'render timechart' to visualise.

Click to flip back

KQL String Operators: contains vs has
OperatorBehaviourPerformanceExample Match
containsSubstring match β€” searches anywhere in the stringSlower (full text scan)'api/users' contains 'user' β†’ true
hasTerm match β€” searches for whole terms (word boundaries)Faster (uses term index)'api/users' has 'users' β†’ true, 'api/users' has 'user' β†’ false
== (equals)Exact full-string match, case-sensitiveFastest'api/users' == 'api/users' β†’ true
startswithPrefix matchFast'api/users' startswith 'api/' β†’ true
matches regexRegular expression matchSlowest'api/users/123' matches regex 'users/[0-9]+' β†’ true
Scenario: Amira teaches Farah KQL investigation

πŸ›οΈ Farah (junior consultant) needs to investigate why a government client’s portal had 50 errors between 2am and 4am. Dr. Amira walks her through the KQL workflow.

Step 1 β€” Scope the problem:

requests
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where toint(resultCode) >= 500
| summarize count() by resultCode, name

Result: 48 of the 50 errors are 503 on GET /api/citizen-portal/status.

Step 2 β€” Check dependencies for that operation:

dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where operation_Name == "GET /api/citizen-portal/status"
| summarize count() by target, success

Result: The SQL database dependency shows 48 failures.

Step 3 β€” Root cause:

dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where target contains "citizendb" and success == false
| project timestamp, target, resultCode, data
| take 5

Result: All failures show "Login failed for user 'app-service-principal'". The database credentials expired at 2am.

Fix: Rotate the credential and move to managed identity to prevent recurrence.

πŸ’‘ Exam tip: KQL patterns to memorise

The exam may present KQL queries and ask what they return, or present a scenario and ask you to choose the correct query. Memorise these patterns:

  • Filter + count: where + summarize count() by β€” β€œhow many of X grouped by Y”
  • Time bucketing: bin(timestamp, 1h) β€” group by time intervals for charting
  • Top N: order by column desc | take N β€” find worst/best performers
  • Join: join kind=inner (table2) on operation_Id β€” correlate across tables
  • Time filter: ago(24h) or between (datetime(...) .. datetime(...)) β€” scope to time range
  • Visualise: render timechart, render barchart, render piechart

Remember: KQL is pipe-based β€” data flows left to right through operators separated by |.

Knowledge check

Knowledge Check

Jordan's AKS-hosted API has a P95 response time of 4 seconds, but CPU and memory on the pods are low (under 30%). What should Jordan investigate first?

Knowledge Check

Which KQL query correctly returns the average response time per API endpoint over the last 24 hours, grouped into 1-hour buckets?

Knowledge Check

Amira is investigating a distributed transaction where a web API calls three microservices. In Application Insights, what uniquely ties all the telemetry from this single transaction together?

🎬 Video coming soon

Metrics & KQL: Analysing Telemetry & Traces


Congratulations! πŸŽ“ You have completed all 25 modules of the AZ-400 study guide. You now have a comprehensive understanding of designing and implementing Microsoft DevOps solutions β€” from work item tracking and branching strategies through CI/CD pipelines, security and compliance, to instrumentation and monitoring. Go crush that exam!

← Previous

Monitoring for DevOps: Azure Monitor & App Insights

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.