Performance Baselines and Monitoring Tools

Building a performance baseline

Simple explanation

A performance baseline is like knowing your resting heart rate.

If your normal heart rate is 70 BPM, then 120 BPM means something is wrong. But if you don’t know your normal, you can’t tell. A database baseline captures “normal” — CPU usage, query times, I/O rates, wait stats — so you can spot problems when things change.

Creating a baseline

Tomas builds his first baseline for CloudFirst’s client database:

What to capture

Metric Category	Key Metrics	Why It Matters
CPU	% utilisation, vCore usage	Detect compute bottlenecks
Memory	Buffer cache hit ratio, page life expectancy	Memory pressure indicators
I/O	Read/write IOPS, latency, throughput (MB/s)	Storage performance
Queries	Duration, logical reads, CPU time per query	Find expensive queries
Waits	Top wait types, wait time per category	Identify what queries wait on
Connections	Active sessions, blocked sessions	Connection pool health
tempdb	Usage, contention (PAGELATCH waits)	tempdb pressure

When to capture

Capture during normal business hours (not maintenance windows)
Include both peak and off-peak periods
Run for at least one full business cycle (1-2 weeks minimum)
Re-baseline after major changes (migrations, schema changes, scaling)

Baseline DMV queries

-- CPU usage over time (Azure SQL DB)
SELECT end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC;

-- Wait statistics baseline
SELECT wait_type, wait_time_ms, signal_wait_time_ms, waiting_tasks_count
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('SLEEP_TASK','BROKER_TO_FLUSH','CLR_AUTO_EVENT')
ORDER BY wait_time_ms DESC;

-- Top queries by CPU
SELECT TOP 20
  qs.total_worker_time / qs.execution_count AS avg_cpu_time,
  qs.execution_count,
  SUBSTRING(st.text, (qs.statement_start_offset/2)+1, 100) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_cpu_time DESC;

Performance metric sources

Where to Find Performance Data
Source	Platform	Data Type	Retention
Azure Monitor metrics	SQL DB, MI	CPU, DTU, storage, connections	93 days
sys.dm_db_resource_stats	SQL DB	CPU, I/O, memory (15-sec intervals)	1 hour
sys.server_resource_stats	MI	CPU, I/O, storage (5-min intervals)	14 days
sys.dm_os_wait_stats	All platforms	Wait type aggregates since last reset	Until reset
sys.dm_exec_query_stats	All platforms	Cached query plan statistics	While plan is cached
Query Store	All platforms	Historical query performance	Configurable (30 days default)
Log Analytics	SQL DB, MI	Diagnostic logs, audit events	Configurable (up to 2 years)
Extended Events	All platforms	Granular event tracing	Session lifetime
Database watcher	SQL DB, MI	Monitoring dashboard and metrics	Configurable

Interpreting key metrics

Sam teaches Tomas the critical thresholds:

Metric	Healthy	Warning	Critical
CPU %	Under 60%	60-80%	Over 80% sustained
Buffer cache hit ratio	Over 99%	97-99%	Under 97%
Page life expectancy	Over 300 seconds	100-300	Under 100
Avg disk latency	Under 10ms	10-20ms	Over 20ms
Batch requests/sec	Baseline + 20%	Baseline + 50%	Baseline + 100%

Common wait types and what they mean

Wait Type	Indicates	Action
CXPACKET / CXCONSUMER	Parallelism waits — uneven parallel query distribution	Tune MAXDOP, check for skewed parallelism
PAGEIOLATCH_SH	Waiting for data pages from disk	Add memory, improve I/O, optimise queries
LCK_M_X / LCK_M_S	Lock contention — blocking	Investigate blocking chains, optimise transactions
WRITELOG	Transaction log write latency	Move log to faster storage, reduce transaction frequency
SOS_SCHEDULER_YIELD	CPU pressure — queries yielding for other queries	Scale up CPU, optimise expensive queries
RESOURCE_SEMAPHORE	Memory grant waits — queries can’t get enough memory	Add memory, fix large sorts/hashes, check memory grants

Exam tip: Azure SQL DB specific metrics

Azure SQL Database uses DTU percentage as a composite metric:

DTU % = max(CPU %, Data I/O %, Log I/O %)
If DTU % is consistently above 80%, consider scaling up

For vCore databases, check avg_cpu_percent, avg_data_io_percent, and avg_log_write_percent separately in sys.dm_db_resource_stats.

The exam may show a graph and ask “what is the bottleneck?” — look for the highest percentage metric.

Configuring monitoring

Azure Monitor alerts

Navigate to the SQL database in Azure Portal → Alerts
Create alert rules for key metrics (CPU > 80%, DTU > 90%, deadlocks > 0)
Configure action groups (email, SMS, webhook, Logic App)

Diagnostic settings

Database → Diagnostic settings → Add diagnostic setting
Send to: Log Analytics workspace, Storage account, or Event Hub
Select categories: SQLInsights, QueryStoreRuntimeStatistics, Errors, etc.

Azure SQL Analytics (Log Analytics)

Pre-built solution for monitoring multiple Azure SQL databases
Dashboards for DTU usage, query performance, timeouts, errors
KQL queries for custom analysis

Question

What is sys.dm_db_resource_stats and how long does it retain data?

Click or press Enter to reveal answer

Answer

A DMV specific to Azure SQL Database that shows resource utilisation (CPU, I/O, memory) at 15-second intervals. It retains data for 1 hour only — use it for recent monitoring, not historical analysis.

Click to flip back

Question

What does the PAGEIOLATCH_SH wait type indicate?

Click or press Enter to reveal answer

Answer

Queries are waiting for data pages to be read from disk into memory. Indicates I/O pressure — consider adding memory (more buffer cache), moving to faster storage, or optimising queries to read fewer pages.

Click to flip back

Question

What are the three components of DTU percentage?

Click or press Enter to reveal answer

Answer

DTU % = max of (CPU %, Data I/O %, Log I/O %). It's a composite metric — the highest of the three determines the DTU usage. Consistently above 80% suggests scaling up.

Click to flip back

Knowledge Check

Tomas notices that avg_cpu_percent is steady at 30%, but avg_data_io_percent spikes to 95% during business hours. What is the bottleneck?

Knowledge Check

Kenji needs to review query performance from the past 48 hours on Azure SQL Database. sys.dm_db_resource_stats only retains 1 hour. What should he use?

Next up: Database Watcher and Extended Events — deep-dive monitoring with two powerful diagnostic tools.