Azure Site Recovery & Disaster Recovery
Backups protect your data. Site Recovery protects your entire workload. When an entire Azure region goes down, Site Recovery replicates your VMs to a secondary region and orchestrates failover β keeping your business running.
What is Azure Site Recovery?
Azure Site Recovery (ASR) is like having a complete duplicate of your office in another city. If a fire destroys the main office, everyone drives to the backup office and keeps working.
Backup protects your data (files, databases). Site Recovery protects your entire infrastructure β VMs, networking, applications. It continuously replicates your VMs to another Azure region. If the primary region goes down, you βfail overβ to the secondary region. Your VMs come up there, and business continues.
The key difference: backup is about recovering data. Site Recovery is about recovering entire workloads β often within minutes.
Site Recovery vs Azure Backup
| Feature | Azure Backup | Azure Site Recovery |
|---|---|---|
| Purpose | Protect data (restore files, databases, VMs) | Protect workloads (replicate and failover entire environments) |
| Scope | Individual resources | Entire application stacks across regions |
| Recovery speed | Minutes to hours (depends on data size) | Minutes (VMs already replicated) |
| Protection against | Data corruption, accidental deletion, ransomware | Region-wide outages, site-level disasters |
| Data freshness | Point-in-time (last backup) | Near real-time (continuous replication) |
| Key metric | RPO: hours to days (backup frequency) | RPO: seconds to minutes (replication lag) |
| Vault type | Recovery Services or Backup vault | Recovery Services vault only |
RPO and RTO
Two critical metrics define your disaster recovery capabilities:
| Metric | Definition | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss (measured in time) | RPO of 15 minutes means you can lose up to 15 minutes of data |
| RTO (Recovery Time Objective) | Maximum acceptable downtime before recovery | RTO of 1 hour means the workload must be running within 1 hour |
ASR typically achieves:
- RPO: Seconds to minutes (continuous replication)
- RTO: Minutes to an hour (depending on VM count and recovery plan complexity)
Exam tip: RPO vs RTO
RPO answers βhow much data can we afford to lose?β and RTO answers βhow long can we be down?β The exam often presents scenarios where you need to choose a solution based on these requirements. If a scenario needs RPO under 1 hour and RTO under 15 minutes, Site Recovery is the answer β not backup.
Replication architecture
When you enable Site Recovery for an Azure VM, hereβs what gets created:
Source region (where your VMs run):
- Original VMs, disks, and networking
- Cache storage account (stages replication data before sending to target)
Target region (where VMs fail over to):
- Recovery Services vault (manages replication and failover)
- Replica managed disks (mirrors of source disks)
- Target VNet, subnets, and NSGs (can be auto-created or you pre-create them)
- Availability set or zone configuration (matching source)
Replication flow: VM writes data to disk, the Azure Site Recovery extension captures changes, data is sent to the cache storage account, and then replicated to managed disks in the target region.
Real-world: Meridian Financial's DR setup
Meridian Financial runs their core banking application in Australia East (primary). Alex configures Site Recovery to replicate to Australia Southeast (secondary):
- 15 VMs across web, app, and database tiers β all replicated
- A recovery plan groups VMs into tiers: databases start first, then app servers, then web servers
- Custom scripts in the recovery plan update DNS records and reconfigure load balancers
- Test failover runs quarterly (no production impact)
- RPO: under 5 minutes. RTO: under 30 minutes.
Meridianβs compliance team signs off because ASR meets their regulatory requirement of sub-1-hour recovery.
Failover types
| Failover Type | When Used | Production Impact |
|---|---|---|
| Test failover | DR drills and validation | None β creates VMs in an isolated network; production unaffected |
| Planned failover | Known event (e.g., planned maintenance in source region) | Minimal β replication ensures zero data loss |
| Unplanned failover | Actual disaster (region outage) | Some data loss possible (up to latest recovery point) |
Test failover
Test failover is critical β it validates your DR plan without affecting production:
- Select a recovery point
- Choose an isolated VNet (not your production network)
- Azure creates replica VMs in the target region
- Validate the application works correctly
- Clean up (delete the test VMs)
Exam tip: Test failover doesn't affect production
Test failover creates VMs in an isolated virtual network in the target region. It does NOT affect production VMs, replication, or any live workloads. After testing, you clean up the test VMs. The exam expects you to know that test failovers are non-disruptive and should be performed regularly.
Failback
After failing over to the secondary region, you eventually want to return to the primary region. This process is called failback:
- Re-protect β reverse replication from secondary back to primary
- Wait for replication to synchronise
- Planned failover β fail back to the primary region with zero data loss
- Re-protect again β resume normal replication from primary to secondary
Recovery plans
Recovery plans orchestrate multi-VM failover with ordering and automation:
Features:
- Group VMs into tiers (e.g., Group 1: databases, Group 2: app servers, Group 3: web servers)
- Groups fail over in order β Group 1 completes before Group 2 starts
- Add manual actions (pause for verification between groups)
- Add scripts (Azure Automation runbooks for DNS updates, load balancer configuration)
Knowledge check
TechCorp Solutions needs to ensure their production web application can recover within 30 minutes if the entire Australia East region goes down. Data loss of up to 5 minutes is acceptable. Which solution should Alex implement?
Meridian Financial wants to validate their disaster recovery plan without affecting production. Which Site Recovery operation should they perform?
After a successful failover to the secondary region, Alex needs to return workloads to the primary region. What is the correct first step?
π¬ Video coming soon