Monitoring: Metrics, Logs, and Alerts

Why monitoring matters

Simple explanation

Think of monitoring as the dashboard gauges in your car. You need to know when the engine is overheating (throttling), when you’re running low on fuel (RU/s budget), and when something unusual happens (error spikes). Without gauges, you’re driving blind.

Marcus’s monitoring mission

⚙️ Marcus at FinSecure runs Cosmos DB in three production environments with SOC 2 compliance requirements. He needs to:

Detect 429 throttling within minutes
Track query performance trends
Maintain audit logs for compliance
Get alerted when latency exceeds SLA thresholds

Key metrics

Metric	What It Tells You	Alert Threshold
NormalizedRUConsumption	% of provisioned RU/s consumed (0-100%)	Alert at >70% sustained
TotalRequests	Number of requests (including 429s)	Alert on sudden spikes
TotalRequestUnits	Total RU consumed	Track for cost trends
ServerSideLatency	Backend processing time (P50, P99)	Alert at >10ms P99
AvailableStorage	Remaining storage per partition	Alert at >80% used
MetadataRequests	Control plane operations	Unusual spikes may indicate config issues

Exam tip: NormalizedRUConsumption is the key metric

NormalizedRUConsumption is the most important metric for detecting throughput issues. It shows the percentage of provisioned RU/s consumed across all partition key ranges. When it hits 100%, requests get throttled (429 errors).

Key details:

It’s per-physical-partition — a hot partition can show 100% while the account average is 30%
Sustained >70% means you should consider increasing throughput or enabling autoscale
The exam often presents this metric in scenarios asking “what should you monitor for throttling?”

Diagnostic logging

Enable diagnostic logs to capture detailed operation data:

az monitor diagnostic-settings create \
  --name "cosmos-diagnostics" \
  --resource "/subscriptions/.../providers/Microsoft.DocumentDB/databaseAccounts/finsecure-cosmos" \
  --workspace "/subscriptions/.../workspaces/finsecure-logs" \
  --logs '[
    {"category": "QueryRuntimeStatistics", "enabled": true},
    {"category": "DataPlaneRequests", "enabled": true},
    {"category": "PartitionKeyStatistics", "enabled": true},
    {"category": "ControlPlaneRequests", "enabled": true}
  ]'

Log Category	What It Captures
DataPlaneRequests	Every read, write, query with RU cost and latency
QueryRuntimeStatistics	Query execution details, index utilisation, scan counts
PartitionKeyStatistics	Top partition keys by storage and RU consumption
PartitionKeyRUConsumption	RU consumption broken down by partition key
ControlPlaneRequests	Account-level operations (create, delete, scale)

KQL query examples

Query your diagnostic logs with Kusto Query Language (KQL) in Log Analytics:

// Find the most expensive queries in the last hour
CDBDataPlaneRequests
| where TimeGenerated > ago(1h)
| where OperationName == "Query"
| summarize AvgRU = avg(RequestCharge), MaxRU = max(RequestCharge),
            Count = count() by tostring(QueryText)
| order by AvgRU desc
| take 10

// Detect 429 throttling events
CDBDataPlaneRequests
| where TimeGenerated > ago(24h)
| where StatusCode == 429
| summarize ThrottledCount = count() by bin(TimeGenerated, 5m), CollectionName
| order by TimeGenerated desc

// Identify hot partitions
CDBPartitionKeyRUConsumption
| where TimeGenerated > ago(1h)
| summarize TotalRU = sum(RequestCharge) by PartitionKey
| order by TotalRU desc
| take 10

Azure Monitor workbooks

Azure provides built-in Cosmos DB insights — pre-built dashboards in the Azure portal:

Overview: Throughput, requests, storage, availability at a glance
Throughput: NormalizedRUConsumption with partition-level breakdown
Requests: Status code distribution, latency percentiles
Storage: Per-partition storage usage, index size
Failures: 429 rates, timeout rates, error categorisation

Diana’s tip: 🔍 Diana, Marcus’s security auditor, uses the ControlPlaneRequests log category to track who made configuration changes — required for SOC 2 audit trails.

Alert rules

# Alert when NormalizedRUConsumption exceeds 80% for 5 minutes
az monitor metrics alert create \
  --name "cosmos-throttle-warning" \
  --resource "/subscriptions/.../databaseAccounts/finsecure-cosmos" \
  --condition "avg NormalizedRUConsumption > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action "/subscriptions/.../actionGroups/ops-team"

Alert	Condition	Severity
Throttling warning	NormalizedRUConsumption > 80% for 5min	Warning
Throttling critical	NormalizedRUConsumption > 95% for 5min	Critical
Latency spike	ServerSideLatency P99 > 20ms for 10min	Warning
Storage warning	AvailableStorage < 20%	Warning
Error rate	HTTP 5xx count > 10 in 5min	Critical

🎬 Video walkthrough

Flashcards

Question

What is NormalizedRUConsumption and why is it the most important metric?

Click or press Enter to reveal answer

Answer

It shows the percentage (0-100%) of provisioned RU/s consumed across partition key ranges. At 100%, requests are throttled (429 errors). It's per-physical-partition, so a hot partition can be at 100% while the overall average is lower. Alert at >70% sustained.

Click to flip back

Question

Which diagnostic log category captures query execution details?

Click or press Enter to reveal answer

Answer

QueryRuntimeStatistics — it logs query execution details including index utilisation, scan counts, and query text. DataPlaneRequests captures every operation's RU cost and latency. Both are needed for comprehensive query performance monitoring.

Click to flip back

Question

How do you identify hot partitions in Cosmos DB?

Click or press Enter to reveal answer

Answer

Use the PartitionKeyRUConsumption diagnostic log or the PartitionKeyStatistics log category. Query with KQL to find partition keys consuming the most RU/s. NormalizedRUConsumption shows per-partition metrics in Azure Monitor.

Click to flip back

Knowledge Check

Marcus notices NormalizedRUConsumption at 95% sustained for his orders container. What should he do first?

Knowledge Check

Diana needs an audit trail of who changed the Cosmos DB account configuration for SOC 2 compliance. Which diagnostic log category should she enable?

Knowledge Check

Marcus wants to find the top 10 most expensive queries in the last hour. Which tool and approach should he use?

Next up: Backup and Restore — choosing between periodic and continuous backup, configuring policies, and point-in-time restore.