Performance Baselines and Monitoring Tools
Prepare operational performance baselines, identify metric sources, interpret performance data, and configure activity monitoring for Azure SQL.
Building a performance baseline
A performance baseline is like knowing your resting heart rate.
If your normal heart rate is 70 BPM, then 120 BPM means something is wrong. But if you donβt know your normal, you canβt tell. A database baseline captures βnormalβ β CPU usage, query times, I/O rates, wait stats β so you can spot problems when things change.
Creating a baseline
Tomas builds his first baseline for CloudFirstβs client database:
What to capture
| Metric Category | Key Metrics | Why It Matters |
|---|---|---|
| CPU | % utilisation, vCore usage | Detect compute bottlenecks |
| Memory | Buffer cache hit ratio, page life expectancy | Memory pressure indicators |
| I/O | Read/write IOPS, latency, throughput (MB/s) | Storage performance |
| Queries | Duration, logical reads, CPU time per query | Find expensive queries |
| Waits | Top wait types, wait time per category | Identify what queries wait on |
| Connections | Active sessions, blocked sessions | Connection pool health |
| tempdb | Usage, contention (PAGELATCH waits) | tempdb pressure |
When to capture
- Capture during normal business hours (not maintenance windows)
- Include both peak and off-peak periods
- Run for at least one full business cycle (1-2 weeks minimum)
- Re-baseline after major changes (migrations, schema changes, scaling)
Baseline DMV queries
-- CPU usage over time (Azure SQL DB)
SELECT end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC;
-- Wait statistics baseline
SELECT wait_type, wait_time_ms, signal_wait_time_ms, waiting_tasks_count
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('SLEEP_TASK','BROKER_TO_FLUSH','CLR_AUTO_EVENT')
ORDER BY wait_time_ms DESC;
-- Top queries by CPU
SELECT TOP 20
qs.total_worker_time / qs.execution_count AS avg_cpu_time,
qs.execution_count,
SUBSTRING(st.text, (qs.statement_start_offset/2)+1, 100) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_cpu_time DESC;
Performance metric sources
| Source | Platform | Data Type | Retention |
|---|---|---|---|
| Azure Monitor metrics | SQL DB, MI | CPU, DTU, storage, connections | 93 days |
| sys.dm_db_resource_stats | SQL DB | CPU, I/O, memory (15-sec intervals) | 1 hour |
| sys.server_resource_stats | MI | CPU, I/O, storage (5-min intervals) | 14 days |
| sys.dm_os_wait_stats | All platforms | Wait type aggregates since last reset | Until reset |
| sys.dm_exec_query_stats | All platforms | Cached query plan statistics | While plan is cached |
| Query Store | All platforms | Historical query performance | Configurable (30 days default) |
| Log Analytics | SQL DB, MI | Diagnostic logs, audit events | Configurable (up to 2 years) |
| Extended Events | All platforms | Granular event tracing | Session lifetime |
| Database watcher | SQL DB, MI | Monitoring dashboard and metrics | Configurable |
Interpreting key metrics
Sam teaches Tomas the critical thresholds:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| CPU % | Under 60% | 60-80% | Over 80% sustained |
| Buffer cache hit ratio | Over 99% | 97-99% | Under 97% |
| Page life expectancy | Over 300 seconds | 100-300 | Under 100 |
| Avg disk latency | Under 10ms | 10-20ms | Over 20ms |
| Batch requests/sec | Baseline + 20% | Baseline + 50% | Baseline + 100% |
Common wait types and what they mean
| Wait Type | Indicates | Action |
|---|---|---|
| CXPACKET / CXCONSUMER | Parallelism waits β uneven parallel query distribution | Tune MAXDOP, check for skewed parallelism |
| PAGEIOLATCH_SH | Waiting for data pages from disk | Add memory, improve I/O, optimise queries |
| LCK_M_X / LCK_M_S | Lock contention β blocking | Investigate blocking chains, optimise transactions |
| WRITELOG | Transaction log write latency | Move log to faster storage, reduce transaction frequency |
| SOS_SCHEDULER_YIELD | CPU pressure β queries yielding for other queries | Scale up CPU, optimise expensive queries |
| RESOURCE_SEMAPHORE | Memory grant waits β queries canβt get enough memory | Add memory, fix large sorts/hashes, check memory grants |
Exam tip: Azure SQL DB specific metrics
Azure SQL Database uses DTU percentage as a composite metric:
- DTU % = max(CPU %, Data I/O %, Log I/O %)
- If DTU % is consistently above 80%, consider scaling up
For vCore databases, check avg_cpu_percent, avg_data_io_percent, and avg_log_write_percent separately in sys.dm_db_resource_stats.
The exam may show a graph and ask βwhat is the bottleneck?β β look for the highest percentage metric.
Configuring monitoring
Azure Monitor alerts
- Navigate to the SQL database in Azure Portal β Alerts
- Create alert rules for key metrics (CPU > 80%, DTU > 90%, deadlocks > 0)
- Configure action groups (email, SMS, webhook, Logic App)
Diagnostic settings
- Database β Diagnostic settings β Add diagnostic setting
- Send to: Log Analytics workspace, Storage account, or Event Hub
- Select categories: SQLInsights, QueryStoreRuntimeStatistics, Errors, etc.
Azure SQL Analytics (Log Analytics)
- Pre-built solution for monitoring multiple Azure SQL databases
- Dashboards for DTU usage, query performance, timeouts, errors
- KQL queries for custom analysis
Tomas notices that avg_cpu_percent is steady at 30%, but avg_data_io_percent spikes to 95% during business hours. What is the bottleneck?
Kenji needs to review query performance from the past 48 hours on Azure SQL Database. sys.dm_db_resource_stats only retains 1 hour. What should he use?
π¬ Video coming soon
Next up: Database Watcher and Extended Events β deep-dive monitoring with two powerful diagnostic tools.