Monitoring: Metrics, Logs, and Alerts
Monitor Cosmos DB health with key metrics (NormalizedRUConsumption, TotalRequests, ServerSideLatency), diagnostic logs, Azure Monitor workbooks, and alert rules for proactive issue detection.
Why monitoring matters
Think of monitoring as the dashboard gauges in your car. You need to know when the engine is overheating (throttling), when youβre running low on fuel (RU/s budget), and when something unusual happens (error spikes). Without gauges, youβre driving blind.
Marcusβs monitoring mission
βοΈ Marcus at FinSecure runs Cosmos DB in three production environments with SOC 2 compliance requirements. He needs to:
- Detect 429 throttling within minutes
- Track query performance trends
- Maintain audit logs for compliance
- Get alerted when latency exceeds SLA thresholds
Key metrics
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| NormalizedRUConsumption | % of provisioned RU/s consumed (0-100%) | Alert at >70% sustained |
| TotalRequests | Number of requests (including 429s) | Alert on sudden spikes |
| TotalRequestUnits | Total RU consumed | Track for cost trends |
| ServerSideLatency | Backend processing time (P50, P99) | Alert at >10ms P99 |
| AvailableStorage | Remaining storage per partition | Alert at >80% used |
| MetadataRequests | Control plane operations | Unusual spikes may indicate config issues |
Exam tip: NormalizedRUConsumption is the key metric
NormalizedRUConsumption is the most important metric for detecting throughput issues. It shows the percentage of provisioned RU/s consumed across all partition key ranges. When it hits 100%, requests get throttled (429 errors).
Key details:
- Itβs per-physical-partition β a hot partition can show 100% while the account average is 30%
- Sustained >70% means you should consider increasing throughput or enabling autoscale
- The exam often presents this metric in scenarios asking βwhat should you monitor for throttling?β
Diagnostic logging
Enable diagnostic logs to capture detailed operation data:
az monitor diagnostic-settings create \
--name "cosmos-diagnostics" \
--resource "/subscriptions/.../providers/Microsoft.DocumentDB/databaseAccounts/finsecure-cosmos" \
--workspace "/subscriptions/.../workspaces/finsecure-logs" \
--logs '[
{"category": "QueryRuntimeStatistics", "enabled": true},
{"category": "DataPlaneRequests", "enabled": true},
{"category": "PartitionKeyStatistics", "enabled": true},
{"category": "ControlPlaneRequests", "enabled": true}
]'
| Log Category | What It Captures |
|---|---|
| DataPlaneRequests | Every read, write, query with RU cost and latency |
| QueryRuntimeStatistics | Query execution details, index utilisation, scan counts |
| PartitionKeyStatistics | Top partition keys by storage and RU consumption |
| PartitionKeyRUConsumption | RU consumption broken down by partition key |
| ControlPlaneRequests | Account-level operations (create, delete, scale) |
KQL query examples
Query your diagnostic logs with Kusto Query Language (KQL) in Log Analytics:
// Find the most expensive queries in the last hour
CDBDataPlaneRequests
| where TimeGenerated > ago(1h)
| where OperationName == "Query"
| summarize AvgRU = avg(RequestCharge), MaxRU = max(RequestCharge),
Count = count() by tostring(QueryText)
| order by AvgRU desc
| take 10
// Detect 429 throttling events
CDBDataPlaneRequests
| where TimeGenerated > ago(24h)
| where StatusCode == 429
| summarize ThrottledCount = count() by bin(TimeGenerated, 5m), CollectionName
| order by TimeGenerated desc
// Identify hot partitions
CDBPartitionKeyRUConsumption
| where TimeGenerated > ago(1h)
| summarize TotalRU = sum(RequestCharge) by PartitionKey
| order by TotalRU desc
| take 10
Azure Monitor workbooks
Azure provides built-in Cosmos DB insights β pre-built dashboards in the Azure portal:
- Overview: Throughput, requests, storage, availability at a glance
- Throughput: NormalizedRUConsumption with partition-level breakdown
- Requests: Status code distribution, latency percentiles
- Storage: Per-partition storage usage, index size
- Failures: 429 rates, timeout rates, error categorisation
Dianaβs tip: π Diana, Marcusβs security auditor, uses the ControlPlaneRequests log category to track who made configuration changes β required for SOC 2 audit trails.
Alert rules
# Alert when NormalizedRUConsumption exceeds 80% for 5 minutes
az monitor metrics alert create \
--name "cosmos-throttle-warning" \
--resource "/subscriptions/.../databaseAccounts/finsecure-cosmos" \
--condition "avg NormalizedRUConsumption > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action "/subscriptions/.../actionGroups/ops-team"
| Alert | Condition | Severity |
|---|---|---|
| Throttling warning | NormalizedRUConsumption > 80% for 5min | Warning |
| Throttling critical | NormalizedRUConsumption > 95% for 5min | Critical |
| Latency spike | ServerSideLatency P99 > 20ms for 10min | Warning |
| Storage warning | AvailableStorage < 20% | Warning |
| Error rate | HTTP 5xx count > 10 in 5min | Critical |
π¬ Video walkthrough
π¬ Video coming soon
Monitoring Cosmos DB β DP-420 Module 22
Monitoring Cosmos DB β DP-420 Module 22
~16 minFlashcards
Knowledge Check
Marcus notices NormalizedRUConsumption at 95% sustained for his orders container. What should he do first?
Diana needs an audit trail of who changed the Cosmos DB account configuration for SOC 2 compliance. Which diagnostic log category should she enable?
Marcus wants to find the top 10 most expensive queries in the last hour. Which tool and approach should he use?
Next up: Backup and Restore β choosing between periodic and continuous backup, configuring policies, and point-in-time restore.