Metrics & KQL: Analysing Telemetry & Traces
Analyse infrastructure metrics, application performance, and distributed traces. Write basic KQL queries to interrogate logs in Azure Monitor and Application Insights.
Why metrics analysis is a DevOps superpower
Think of a doctorβs check-up.
The doctor measures your blood pressure, heart rate, temperature, and blood oxygen. Each number on its own tells a small story, but together they paint a complete picture of your health. High blood pressure PLUS high heart rate PLUS fever means something very different from high blood pressure alone.
Metrics analysis in DevOps is the same diagnostic process for your applications. CPU usage, memory consumption, response time, and error rate are your applicationβs vital signs. Individually they hint at problems. Together β and with the right query language (KQL) β they tell you exactly what is wrong and where.
Infrastructure performance indicators
Understanding what each metric means β and what it signals when it is abnormal β is critical for DevOps engineers. The exam tests your ability to interpret these indicators, not just collect them.
CPU
| Observation | What It Means | Action |
|---|---|---|
| Sustained above 80% | Compute-bound workload, possible scaling need | Scale up (bigger VM) or scale out (more instances) |
| Spikes correlating with deployments | New code may have a performance regression | Profile the application, compare with pre-deployment baseline |
| Low CPU with slow responses | Bottleneck is elsewhere β disk, network, or external dependency | Investigate dependencies, check I/O wait |
Memory
| Observation | What It Means | Action |
|---|---|---|
| Steady growth over time | Memory leak β objects allocated but not released | Profile the application, check for unclosed connections or large caches |
| Sudden spike | Large request, burst of traffic, or loading a big dataset | Check if this correlates with traffic patterns |
| OOM (Out of Memory) kills | Process exceeded container or VM memory limit | Increase memory limit, fix the leak, or optimize memory usage |
Disk
| Observation | What It Means | Action |
|---|---|---|
| High IOPS with slow response | Disk throughput bottleneck | Upgrade to Premium SSD or Ultra Disk, or add caching |
| Disk queue length increasing | More I/O requests than the disk can handle | Scale storage tier or optimise I/O patterns (batching, async) |
| Disk space decreasing | Logs, temp files, or data growth filling the disk | Implement log rotation, add monitoring alerts at 80% |
Network
| Observation | What It Means | Action |
|---|---|---|
| High latency between services | Network congestion, distance between resources, or DNS issues | Co-locate resources, use private endpoints, check NSG rules |
| Packet loss | Network infrastructure issues or misconfigured NSGs | Check network health, review NSG flow logs |
| Bandwidth saturation | Data transfer exceeding the VM or network tier limits | Scale network tier, compress data, or optimise transfer patterns |
Application performance metrics
Application Insights collects four primary performance metrics. Together, they form your application health baseline.
| Metric | What It Measures | Healthy Range |
|---|---|---|
| Server response time | How long your app takes to respond to requests (P50, P95, P99) | Depends on SLA β typically under 500ms for APIs |
| Server request rate | Number of requests per second | Baseline varies β watch for unexpected drops (outage) or spikes (attack) |
| Failed request rate | Percentage of requests returning 4xx/5xx status codes | Under 1% for healthy apps |
| Dependency call duration | How long outgoing calls to databases, APIs, caches take | Under your SLA minus processing time |
The Four Golden Signals (Google SRE)
The exam may reference Googleβs Site Reliability Engineering framework. The four golden signals align closely with Application Insights metrics:
| Golden Signal | Definition | App Insights Metric |
|---|---|---|
| Latency | Time to serve a request (distinguish successful vs failed) | Server response time (split by success/failure) |
| Traffic | Volume of requests the system handles | Server request rate |
| Errors | Rate of failed requests | Failed request rate and exception count |
| Saturation | How full the system is (resource utilisation) | CPU, memory, disk metrics via VM/Container Insights |
Distributed tracing in Application Insights
In microservices architectures, a single user action may span multiple services. Distributed tracing tracks a request as it flows through the entire chain.
How it works
- The first service generates an operation ID β a unique identifier for the end-to-end transaction
- Each subsequent service call propagates the operation ID in HTTP headers (
traceparentin W3C Trace Context, orRequest-Idin legacy format) - Application Insights correlates all telemetry (requests, dependencies, exceptions) with the same operation ID
- The Transaction diagnostics view shows the complete chain with timing for each hop
Application Map
The Application Map in Application Insights visualises:
- All components (your services, databases, external APIs)
- Call volumes between components (arrow thickness represents traffic)
- Error rates on each component and connection
- Average response times
This is invaluable for identifying which service in a chain is the bottleneck.
End-to-end transaction details
When investigating a slow request:
- Open the Performance blade in Application Insights
- Drill into a slow operation (e.g.,
GET /api/orders) - Click into a specific slow request to see the end-to-end transaction
- The timeline shows every dependency call, their duration, and whether they succeeded or failed
- You can immediately see: βThe request took 3.2 seconds because the SQL dependency took 2.8 secondsβ
Scenario: Jordan traces a slow API response
βοΈ Jordan Riveraβs team gets reports that the /api/media/transcode endpoint is slow. The response time P95 jumped from 800ms to 4.5 seconds after yesterdayβs deployment.
Jordanβs investigation using Application Insights:
- Opens the Performance blade, filters to
POST /api/media/transcode - Sees P95 jumped from 800ms to 4.5s at 2pm yesterday β exactly when Chen (SRE) deployed version 2.4.1
- Drills into a 4.5s transaction in the end-to-end transaction view
- Timeline shows:
POST /api/media/transcodeβ 4.5s total- Dependency:
SELECTontranscoding_jobstable β 12ms (fine) - Dependency:
POSTtostorage-api/uploadβ 4.2s (THE BOTTLENECK) - Dependency:
PUTtoqueue/transcode-requestβ 45ms (fine)
- The storage-api call is the problem. Checks the storage-apiβs Application Insights β a new retry policy in v2.4.1 is retrying on every 409 (conflict) response with exponential backoff.
Fix: Avery (dev lead) updates the retry policy to exclude 409 from retriable status codes. P95 drops back to 850ms.
KQL fundamentals for the AZ-400 exam
Kusto Query Language (KQL) is the query language for Azure Monitor Logs and Application Insights. The exam tests basic KQL β you do not need expert-level queries, but you must know the core operators.
Essential operators
| Operator | Purpose | Example |
|---|---|---|
| where | Filter rows | where resultCode == "500" |
| project | Select specific columns | project timestamp, name, duration |
| summarize | Aggregate data | summarize count() by resultCode |
| extend | Add calculated columns | extend durationSec = duration / 1000 |
| order by | Sort results | order by timestamp desc |
| take / limit | Limit row count | take 100 |
| render | Visualise as chart | render timechart |
| ago | Relative time reference | where timestamp > ago(24h) |
| between | Range filter | where duration between (1000 .. 5000) |
| contains / has | String matching | where name contains "api" |
| join | Combine tables | join kind=inner (dependencies) on operation_Id |
Common Application Insights tables
| Table | Content |
|---|---|
| requests | Incoming HTTP requests |
| dependencies | Outgoing calls (SQL, HTTP, etc.) |
| exceptions | Application exceptions |
| traces | Diagnostic log messages |
| customEvents | Custom business events |
| customMetrics | Custom numeric metrics |
| pageViews | Client-side page load telemetry |
Exam-ready KQL examples
Find the top 10 slowest requests in the last 24 hours:
requests
| where timestamp > ago(24h)
| order by duration desc
| take 10
| project timestamp, name, duration, resultCode
Count requests by status code in the last hour:
requests
| where timestamp > ago(1h)
| summarize count() by resultCode
| order by count_ desc
Average response time per API endpoint, charted over time:
requests
| where timestamp > ago(7d)
| summarize avg(duration) by name, bin(timestamp, 1h)
| render timechart
Find all exceptions related to a specific operation:
exceptions
| where timestamp > ago(24h)
| where operation_Name == "POST /api/orders"
| project timestamp, type, outerMessage, innermostMessage
| order by timestamp desc
Correlate a slow request with its dependencies (join):
requests
| where timestamp > ago(1h) and duration > 3000
| join kind=inner (
dependencies
| where timestamp > ago(1h)
) on operation_Id
| project requestName = name, requestDuration = duration, depTarget = target, depDuration = duration1
| order by depDuration desc
| Operator | Behaviour | Performance | Example Match |
|---|---|---|---|
| contains | Substring match β searches anywhere in the string | Slower (full text scan) | 'api/users' contains 'user' β true |
| has | Term match β searches for whole terms (word boundaries) | Faster (uses term index) | 'api/users' has 'users' β true, 'api/users' has 'user' β false |
| == (equals) | Exact full-string match, case-sensitive | Fastest | 'api/users' == 'api/users' β true |
| startswith | Prefix match | Fast | 'api/users' startswith 'api/' β true |
| matches regex | Regular expression match | Slowest | 'api/users/123' matches regex 'users/[0-9]+' β true |
Scenario: Amira teaches Farah KQL investigation
ποΈ Farah (junior consultant) needs to investigate why a government clientβs portal had 50 errors between 2am and 4am. Dr. Amira walks her through the KQL workflow.
Step 1 β Scope the problem:
requests
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where toint(resultCode) >= 500
| summarize count() by resultCode, nameResult: 48 of the 50 errors are 503 on GET /api/citizen-portal/status.
Step 2 β Check dependencies for that operation:
dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where operation_Name == "GET /api/citizen-portal/status"
| summarize count() by target, successResult: The SQL database dependency shows 48 failures.
Step 3 β Root cause:
dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where target contains "citizendb" and success == false
| project timestamp, target, resultCode, data
| take 5Result: All failures show "Login failed for user 'app-service-principal'". The database credentials expired at 2am.
Fix: Rotate the credential and move to managed identity to prevent recurrence.
Exam tip: KQL patterns to memorise
The exam may present KQL queries and ask what they return, or present a scenario and ask you to choose the correct query. Memorise these patterns:
- Filter + count:
where+summarize count() byβ βhow many of X grouped by Yβ - Time bucketing:
bin(timestamp, 1h)β group by time intervals for charting - Top N:
order by column desc | take Nβ find worst/best performers - Join:
join kind=inner (table2) on operation_Idβ correlate across tables - Time filter:
ago(24h)orbetween (datetime(...) .. datetime(...))β scope to time range - Visualise:
render timechart,render barchart,render piechart
Remember: KQL is pipe-based β data flows left to right through operators separated by |.
Knowledge check
Jordan's AKS-hosted API has a P95 response time of 4 seconds, but CPU and memory on the pods are low (under 30%). What should Jordan investigate first?
Which KQL query correctly returns the average response time per API endpoint over the last 24 hours, grouped into 1-hour buckets?
Amira is investigating a distributed transaction where a web API calls three microservices. In Application Insights, what uniquely ties all the telemetry from this single transaction together?
π¬ Video coming soon
Metrics & KQL: Analysing Telemetry & Traces
Congratulations! π You have completed all 25 modules of the AZ-400 study guide. You now have a comprehensive understanding of designing and implementing Microsoft DevOps solutions β from work item tracking and branching strategies through CI/CD pipelines, security and compliance, to instrumentation and monitoring. Go crush that exam!