HA/DR Strategy: RPO, RTO, and Architecture

Planning for disaster

Simple explanation

HA/DR is insurance for your database.

High Availability (HA) keeps things running during small failures — a server restarts, a disk fails. Like having a spare tyre in the boot.

Disaster Recovery (DR) keeps things running during big failures — an entire data centre goes down. Like having a second car at a different location.

RPO (Recovery Point Objective) = how much data can you afford to lose? “We can lose up to 5 minutes of transactions.”

RTO (Recovery Time Objective) = how fast must you recover? “We need to be back online within 1 hour.”

RPO and RTO explained

Metric	Question It Answers	Measured In	Example
RPO	How much data loss is acceptable?	Time (seconds, minutes, hours)	RPO = 5 min → lose at most 5 minutes of transactions
RTO	How long can the system be down?	Time (minutes, hours)	RTO = 1 hour → must be back online within 60 minutes

RPO/RTO by solution

HA/DR Solutions: RPO and RTO
Solution	RPO	RTO	Platform	Automatic Failover?
Built-in HA (local redundancy)	0 (synchronous)	< 30 sec	SQL DB, MI	Yes
Zone-redundant HA	0 (synchronous)	< 30 sec	SQL DB, MI	Yes
Active geo-replication	< 5 sec	< 30 sec (manual failover)	SQL DB only	No (manual)
Failover groups	< 5 sec	< 1 hour (automatic)	SQL DB, MI	Yes
Always On AG (sync)	0	< 1 min	SQL VMs, MI	Yes (with listener)
Always On AG (async)	Minutes	Minutes to hours	SQL VMs	Manual
Log shipping	Minutes to hours	Minutes to hours	SQL VMs	Manual
Backup/restore	Hours (depends on backup frequency)	Hours	All	Manual

Azure-specific HA/DR solutions

Built-in high availability

Every Azure SQL Database and MI comes with HA — no configuration needed:

Tier	HA Architecture	Replicas
General Purpose	Remote storage with compute failover	1 primary (failover to standby)
Business Critical	Local SSD with Always On AG	1 primary + 1-3 readable secondary replicas
Hyperscale	Page server architecture	0-4 named replicas (read or HA)

Zone-redundant deployments

Spread replicas across availability zones in the same region
Protects against data centre (zone) failures
Available for SQL DB (Premium, Business Critical, Hyperscale) and MI (Business Critical)

Hybrid HA/DR

Kenji’s hybrid strategy for NorthStar:

Scenario	Solution
On-prem SQL Server + Azure SQL VM	Distributed AG spanning on-prem and Azure VM
On-prem SQL Server + Azure SQL MI	MI link for near real-time replication
On-prem backup to cloud	Backup to Azure Blob Storage via BACKUP TO URL
Gradual migration with DR	Log shipping to Azure VM during migration

Managed Instance link

The MI link creates a near real-time replication connection between on-prem SQL Server (or Azure VM) and Azure SQL Managed Instance:

Uses distributed availability group technology
One-way replication: on-prem → MI (readable secondary)
Can be used for DR (failover to MI if on-prem fails)
Can be used for migration (cutover to MI when ready)
SQL Server 2016+ supported as source

Testing HA/DR

A plan you’ve never tested is a plan that won’t work. Kenji’s testing procedures:

Test	How	Frequency
Planned failover	Initiate failover group failover to secondary region	Quarterly
Backup restore	Restore a recent backup to a test server and validate	Monthly
Point-in-time restore	Restore to a specific time, verify data integrity	Quarterly
DR drill	Simulate primary region failure, verify applications connect to secondary	Annually
Runbook validation	Walk through DR runbook steps with the team	Semi-annually

Testing checklist:

Define success criteria before testing
Notify stakeholders of the test window
Verify application connectivity after failover
Measure actual RTO (was it within target?)
Verify data integrity (was RPO met?)
Document results and update the DR runbook

Question

What is the difference between RPO and RTO?

Click or press Enter to reveal answer

Answer

RPO = how much data loss is acceptable (measured in time). RTO = how long the system can be down before recovery (measured in time). Lower values = higher cost and complexity.

Click to flip back

Question

What HA architecture does Azure SQL Database Business Critical tier use?

Click or press Enter to reveal answer

Answer

Always On Availability Group with local SSD storage. 1 primary + up to 3 readable secondary replicas. Synchronous replication within the cluster. RPO = 0.

Click to flip back

Question

What is the MI link used for?

Click or press Enter to reveal answer

Answer

Near real-time replication from on-prem SQL Server (or Azure VM) to Azure SQL Managed Instance. Uses distributed AG technology. One-way replication for DR or migration purposes.

Click to flip back

Knowledge Check

NorthStar's ERP system requires RPO of 0 (zero data loss) and RTO under 1 minute. The database runs on Azure SQL Managed Instance. Which HA solution meets these requirements?

Knowledge Check

Kenji needs DR for on-premises SQL Server 2019 to Azure, with the ability to fail over to Azure if the data centre goes down. What should he implement?

Next up: Backup and Restore: Strategy and Native Tools — plan backup strategies and execute backups using native tools and T-SQL.