HA/DR Strategy: RPO, RTO, and Architecture
Plan high availability and disaster recovery strategies based on RPO/RTO requirements. Evaluate solutions for hybrid and Azure-only deployments.
Planning for disaster
HA/DR is insurance for your database.
High Availability (HA) keeps things running during small failures β a server restarts, a disk fails. Like having a spare tyre in the boot.
Disaster Recovery (DR) keeps things running during big failures β an entire data centre goes down. Like having a second car at a different location.
RPO (Recovery Point Objective) = how much data can you afford to lose? βWe can lose up to 5 minutes of transactions.β
RTO (Recovery Time Objective) = how fast must you recover? βWe need to be back online within 1 hour.β
RPO and RTO explained
| Metric | Question It Answers | Measured In | Example |
|---|---|---|---|
| RPO | How much data loss is acceptable? | Time (seconds, minutes, hours) | RPO = 5 min β lose at most 5 minutes of transactions |
| RTO | How long can the system be down? | Time (minutes, hours) | RTO = 1 hour β must be back online within 60 minutes |
RPO/RTO by solution
| Solution | RPO | RTO | Platform | Automatic Failover? |
|---|---|---|---|---|
| Built-in HA (local redundancy) | 0 (synchronous) | < 30 sec | SQL DB, MI | Yes |
| Zone-redundant HA | 0 (synchronous) | < 30 sec | SQL DB, MI | Yes |
| Active geo-replication | < 5 sec | < 30 sec (manual failover) | SQL DB only | No (manual) |
| Failover groups | < 5 sec | < 1 hour (automatic) | SQL DB, MI | Yes |
| Always On AG (sync) | 0 | < 1 min | SQL VMs, MI | Yes (with listener) |
| Always On AG (async) | Minutes | Minutes to hours | SQL VMs | Manual |
| Log shipping | Minutes to hours | Minutes to hours | SQL VMs | Manual |
| Backup/restore | Hours (depends on backup frequency) | Hours | All | Manual |
Azure-specific HA/DR solutions
Built-in high availability
Every Azure SQL Database and MI comes with HA β no configuration needed:
| Tier | HA Architecture | Replicas |
|---|---|---|
| General Purpose | Remote storage with compute failover | 1 primary (failover to standby) |
| Business Critical | Local SSD with Always On AG | 1 primary + 1-3 readable secondary replicas |
| Hyperscale | Page server architecture | 0-4 named replicas (read or HA) |
Zone-redundant deployments
- Spread replicas across availability zones in the same region
- Protects against data centre (zone) failures
- Available for SQL DB (Premium, Business Critical, Hyperscale) and MI (Business Critical)
Hybrid HA/DR
Kenjiβs hybrid strategy for NorthStar:
| Scenario | Solution |
|---|---|
| On-prem SQL Server + Azure SQL VM | Distributed AG spanning on-prem and Azure VM |
| On-prem SQL Server + Azure SQL MI | MI link for near real-time replication |
| On-prem backup to cloud | Backup to Azure Blob Storage via BACKUP TO URL |
| Gradual migration with DR | Log shipping to Azure VM during migration |
Managed Instance link
The MI link creates a near real-time replication connection between on-prem SQL Server (or Azure VM) and Azure SQL Managed Instance:
- Uses distributed availability group technology
- One-way replication: on-prem β MI (readable secondary)
- Can be used for DR (failover to MI if on-prem fails)
- Can be used for migration (cutover to MI when ready)
- SQL Server 2016+ supported as source
Testing HA/DR
A plan youβve never tested is a plan that wonβt work. Kenjiβs testing procedures:
| Test | How | Frequency |
|---|---|---|
| Planned failover | Initiate failover group failover to secondary region | Quarterly |
| Backup restore | Restore a recent backup to a test server and validate | Monthly |
| Point-in-time restore | Restore to a specific time, verify data integrity | Quarterly |
| DR drill | Simulate primary region failure, verify applications connect to secondary | Annually |
| Runbook validation | Walk through DR runbook steps with the team | Semi-annually |
Testing checklist:
- Define success criteria before testing
- Notify stakeholders of the test window
- Verify application connectivity after failover
- Measure actual RTO (was it within target?)
- Verify data integrity (was RPO met?)
- Document results and update the DR runbook
NorthStar's ERP system requires RPO of 0 (zero data loss) and RTO under 1 minute. The database runs on Azure SQL Managed Instance. Which HA solution meets these requirements?
Kenji needs DR for on-premises SQL Server 2019 to Azure, with the ability to fail over to Azure if the data centre goes down. What should he implement?
π¬ Video coming soon
Next up: Backup and Restore: Strategy and Native Tools β plan backup strategies and execute backups using native tools and T-SQL.