Metrics & KQL: Analysing Telemetry & Traces

Why metrics analysis is a DevOps superpower

Simple explanation

Think of a doctor’s check-up.

The doctor measures your blood pressure, heart rate, temperature, and blood oxygen. Each number on its own tells a small story, but together they paint a complete picture of your health. High blood pressure PLUS high heart rate PLUS fever means something very different from high blood pressure alone.

Metrics analysis in DevOps is the same diagnostic process for your applications. CPU usage, memory consumption, response time, and error rate are your application’s vital signs. Individually they hint at problems. Together — and with the right query language (KQL) — they tell you exactly what is wrong and where.

Infrastructure performance indicators

Understanding what each metric means — and what it signals when it is abnormal — is critical for DevOps engineers. The exam tests your ability to interpret these indicators, not just collect them.

CPU

Observation	What It Means	Action
Sustained above 80%	Compute-bound workload, possible scaling need	Scale up (bigger VM) or scale out (more instances)
Spikes correlating with deployments	New code may have a performance regression	Profile the application, compare with pre-deployment baseline
Low CPU with slow responses	Bottleneck is elsewhere — disk, network, or external dependency	Investigate dependencies, check I/O wait

Memory

Observation	What It Means	Action
Steady growth over time	Memory leak — objects allocated but not released	Profile the application, check for unclosed connections or large caches
Sudden spike	Large request, burst of traffic, or loading a big dataset	Check if this correlates with traffic patterns
OOM (Out of Memory) kills	Process exceeded container or VM memory limit	Increase memory limit, fix the leak, or optimize memory usage

Disk

Observation	What It Means	Action
High IOPS with slow response	Disk throughput bottleneck	Upgrade to Premium SSD or Ultra Disk, or add caching
Disk queue length increasing	More I/O requests than the disk can handle	Scale storage tier or optimise I/O patterns (batching, async)
Disk space decreasing	Logs, temp files, or data growth filling the disk	Implement log rotation, add monitoring alerts at 80%

Network

Observation	What It Means	Action
High latency between services	Network congestion, distance between resources, or DNS issues	Co-locate resources, use private endpoints, check NSG rules
Packet loss	Network infrastructure issues or misconfigured NSGs	Check network health, review NSG flow logs
Bandwidth saturation	Data transfer exceeding the VM or network tier limits	Scale network tier, compress data, or optimise transfer patterns

Question

A VM has low CPU utilisation but application response times are very slow. What should you investigate?

Click or press Enter to reveal answer

Answer

The bottleneck is NOT compute. Investigate: 1. External dependencies — are database queries or API calls slow? (Check dependency metrics in App Insights) 2. Disk I/O — is the app waiting on disk reads/writes? (Check IOPS and disk queue length) 3. Network — is there high latency between services? (Check network metrics) 4. Thread starvation — is the app blocked waiting on locks or connections? (Check thread pool metrics)

Click to flip back

Application performance metrics

Application Insights collects four primary performance metrics. Together, they form your application health baseline.

Metric	What It Measures	Healthy Range
Server response time	How long your app takes to respond to requests (P50, P95, P99)	Depends on SLA — typically under 500ms for APIs
Server request rate	Number of requests per second	Baseline varies — watch for unexpected drops (outage) or spikes (attack)
Failed request rate	Percentage of requests returning 4xx/5xx status codes	Under 1% for healthy apps
Dependency call duration	How long outgoing calls to databases, APIs, caches take	Under your SLA minus processing time

The Four Golden Signals (Google SRE)

The exam may reference Google’s Site Reliability Engineering framework. The four golden signals align closely with Application Insights metrics:

Four Golden Signals vs Application Insights
Golden Signal	Definition	App Insights Metric
Latency	Time to serve a request (distinguish successful vs failed)	Server response time (split by success/failure)
Traffic	Volume of requests the system handles	Server request rate
Errors	Rate of failed requests	Failed request rate and exception count
Saturation	How full the system is (resource utilisation)	CPU, memory, disk metrics via VM/Container Insights

Question

What are the Four Golden Signals from Google SRE, and how do they map to Azure monitoring?

Click or press Enter to reveal answer

Answer

1. Latency → Application Insights server response time. 2. Traffic → Application Insights request rate. 3. Errors → Application Insights failed requests and exceptions. 4. Saturation → Azure Monitor metrics for CPU, memory, disk, and network utilisation. Monitor all four to detect issues before they become outages.

Click to flip back

Distributed tracing in Application Insights

In microservices architectures, a single user action may span multiple services. Distributed tracing tracks a request as it flows through the entire chain.

How it works

The first service generates an operation ID — a unique identifier for the end-to-end transaction
Each subsequent service call propagates the operation ID in HTTP headers (traceparent in W3C Trace Context, or Request-Id in legacy format)
Application Insights correlates all telemetry (requests, dependencies, exceptions) with the same operation ID
The Transaction diagnostics view shows the complete chain with timing for each hop

Application Map

The Application Map in Application Insights visualises:

All components (your services, databases, external APIs)
Call volumes between components (arrow thickness represents traffic)
Error rates on each component and connection
Average response times

This is invaluable for identifying which service in a chain is the bottleneck.

End-to-end transaction details

When investigating a slow request:

Open the Performance blade in Application Insights
Drill into a slow operation (e.g., GET /api/orders)
Click into a specific slow request to see the end-to-end transaction
The timeline shows every dependency call, their duration, and whether they succeeded or failed
You can immediately see: “The request took 3.2 seconds because the SQL dependency took 2.8 seconds”

Scenario: Jordan traces a slow API response

☁️ Jordan Rivera’s team gets reports that the /api/media/transcode endpoint is slow. The response time P95 jumped from 800ms to 4.5 seconds after yesterday’s deployment.

Jordan’s investigation using Application Insights:

Opens the Performance blade, filters to POST /api/media/transcode
Sees P95 jumped from 800ms to 4.5s at 2pm yesterday — exactly when Chen (SRE) deployed version 2.4.1
Drills into a 4.5s transaction in the end-to-end transaction view
Timeline shows:
- POST /api/media/transcode → 4.5s total
- Dependency: SELECT on transcoding_jobs table → 12ms (fine)
- Dependency: POST to storage-api/upload → 4.2s (THE BOTTLENECK)
- Dependency: PUT to queue/transcode-request → 45ms (fine)
The storage-api call is the problem. Checks the storage-api’s Application Insights — a new retry policy in v2.4.1 is retrying on every 409 (conflict) response with exponential backoff.

Fix: Avery (dev lead) updates the retry policy to exclude 409 from retriable status codes. P95 drops back to 850ms.

Question

What does the Application Map in Application Insights show you?

Click or press Enter to reveal answer

Answer

The Application Map visualises all components in your architecture (services, databases, external APIs) with arrows showing call relationships. Each component shows its error rate and average response time. Arrow thickness represents traffic volume. It helps you identify which component in a distributed system is the bottleneck, experiencing errors, or receiving unusual traffic.

Click to flip back

KQL fundamentals for the AZ-400 exam

Kusto Query Language (KQL) is the query language for Azure Monitor Logs and Application Insights. The exam tests basic KQL — you do not need expert-level queries, but you must know the core operators.

Essential operators

Operator	Purpose	Example
where	Filter rows	`where resultCode == "500"`
project	Select specific columns	`project timestamp, name, duration`
summarize	Aggregate data	`summarize count() by resultCode`
extend	Add calculated columns	`extend durationSec = duration / 1000`
order by	Sort results	`order by timestamp desc`
take / limit	Limit row count	`take 100`
render	Visualise as chart	`render timechart`
ago	Relative time reference	`where timestamp > ago(24h)`
between	Range filter	`where duration between (1000 .. 5000)`
contains / has	String matching	`where name contains "api"`
join	Combine tables	`join kind=inner (dependencies) on operation_Id`

Common Application Insights tables

Table	Content
requests	Incoming HTTP requests
dependencies	Outgoing calls (SQL, HTTP, etc.)
exceptions	Application exceptions
traces	Diagnostic log messages
customEvents	Custom business events
customMetrics	Custom numeric metrics
pageViews	Client-side page load telemetry

Exam-ready KQL examples

Find the top 10 slowest requests in the last 24 hours:

requests
| where timestamp > ago(24h)
| order by duration desc
| take 10
| project timestamp, name, duration, resultCode

Count requests by status code in the last hour:

requests
| where timestamp > ago(1h)
| summarize count() by resultCode
| order by count_ desc

Average response time per API endpoint, charted over time:

requests
| where timestamp > ago(7d)
| summarize avg(duration) by name, bin(timestamp, 1h)
| render timechart

Find all exceptions related to a specific operation:

exceptions
| where timestamp > ago(24h)
| where operation_Name == "POST /api/orders"
| project timestamp, type, outerMessage, innermostMessage
| order by timestamp desc

Correlate a slow request with its dependencies (join):

requests
| where timestamp > ago(1h) and duration > 3000
| join kind=inner (
    dependencies
    | where timestamp > ago(1h)
  ) on operation_Id
| project requestName = name, requestDuration = duration, depTarget = target, depDuration = duration1
| order by depDuration desc

Question

Write a KQL query to find the count of failed requests (5xx) per hour over the last 7 days.

Click or press Enter to reveal answer

Answer

requests | where timestamp > ago(7d) | where toint(resultCode) >= 500 | summarize failedCount = count() by bin(timestamp, 1h) | render timechart Key operators: 'where' to filter, 'toint()' to convert resultCode string to number, 'summarize count() by bin()' to bucket by time, 'render timechart' to visualise.

Click to flip back

KQL String Operators: contains vs has
Operator	Behaviour	Performance	Example Match
contains	Substring match — searches anywhere in the string	Slower (full text scan)	'api/users' contains 'user' → true
has	Term match — searches for whole terms (word boundaries)	Faster (uses term index)	'api/users' has 'users' → true, 'api/users' has 'user' → false
== (equals)	Exact full-string match, case-sensitive	Fastest	'api/users' == 'api/users' → true
startswith	Prefix match	Fast	'api/users' startswith 'api/' → true
matches regex	Regular expression match	Slowest	'api/users/123' matches regex 'users/[0-9]+' → true

Scenario: Amira teaches Farah KQL investigation

🏛️ Farah (junior consultant) needs to investigate why a government client’s portal had 50 errors between 2am and 4am. Dr. Amira walks her through the KQL workflow.

Step 1 — Scope the problem:

requests
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where toint(resultCode) >= 500
| summarize count() by resultCode, name

Result: 48 of the 50 errors are 503 on GET /api/citizen-portal/status.

Step 2 — Check dependencies for that operation:

dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where operation_Name == "GET /api/citizen-portal/status"
| summarize count() by target, success

Result: The SQL database dependency shows 48 failures.

Step 3 — Root cause:

dependencies
| where timestamp between (datetime(2026-04-15 02:00) .. datetime(2026-04-15 04:00))
| where target contains "citizendb" and success == false
| project timestamp, target, resultCode, data
| take 5

Result: All failures show "Login failed for user 'app-service-principal'". The database credentials expired at 2am.

Fix: Rotate the credential and move to managed identity to prevent recurrence.

Exam tip: KQL patterns to memorise

The exam may present KQL queries and ask what they return, or present a scenario and ask you to choose the correct query. Memorise these patterns:

Filter + count: where + summarize count() by — “how many of X grouped by Y”
Time bucketing: bin(timestamp, 1h) — group by time intervals for charting
Top N: order by column desc | take N — find worst/best performers
Join: join kind=inner (table2) on operation_Id — correlate across tables
Time filter: ago(24h) or between (datetime(...) .. datetime(...)) — scope to time range
Visualise: render timechart, render barchart, render piechart

Remember: KQL is pipe-based — data flows left to right through operators separated by |.

Knowledge check

Knowledge Check

Jordan's AKS-hosted API has a P95 response time of 4 seconds, but CPU and memory on the pods are low (under 30%). What should Jordan investigate first?

Knowledge Check

Which KQL query correctly returns the average response time per API endpoint over the last 24 hours, grouped into 1-hour buckets?

Knowledge Check

Amira is investigating a distributed transaction where a web API calls three microservices. In Application Insights, what uniquely ties all the telemetry from this single transaction together?

Congratulations! 🎓 You have completed all 25 modules of the AZ-400 study guide. You now have a comprehensive understanding of designing and implementing Microsoft DevOps solutions — from work item tracking and branching strategies through CI/CD pipelines, security and compliance, to instrumentation and monitoring. Go crush that exam!