Disaster Recovery Implementation
Implement HANA multi-tier replication for HA plus DR, configure Azure Site Recovery with ordered recovery plans, set up backup-based DR with geo-redundant vaults, ANF cross-region replication, and DR testing procedures.
Putting the DR plan into action
π‘οΈ Lars spreads the DR architecture diagram across the table. βWe know the strategy: HSR async for HANA, ASR for the app tier, backup for non-prod. Now we need to build it. Dr. Schmidt wants a documented, tested DR solution β not just a plan on paper.β
βοΈ Mei rolls up her sleeves. βLet me walk you through each implementation. The trickiest part is HANA multi-tier replication β combining HA and DR into a single replication chain.β
Think of it like a chain of messengers.
The primary HANA database (Node 1) sends every message to the local backup (Node 2) instantly and waits for confirmation (synchronous β for HA). Node 2 then forwards the message to a distant backup (Node 3) without waiting for confirmation (asynchronous β for DR). If Node 1 fails, Node 2 takes over locally. If the entire city floods, Node 3 in another city has almost all the messages. Three nodes, two replication links, two different speeds.
HANA multi-tier replication (3-node topology)
The most robust HANA DR architecture uses three nodes:
Node 1 (Primary, Region A, Zone 1) β active production instance. Replicates synchronously to Node 2.
Node 2 (HA Secondary, Region A, Zone 2) β local HA standby. Receives synchronous replication from Node 1. If Node 1 fails, Node 2 becomes primary (automated via Pacemaker). Also replicates asynchronously to Node 3.
Node 3 (DR Tertiary, Region B) β remote DR standby. Receives asynchronous replication from Node 2. If Region A is lost entirely, Node 3 is manually promoted to primary.
| Aspect | Node 1 to Node 2 (HA) | Node 2 to Node 3 (DR) |
|---|---|---|
| Replication mode | Synchronous (SYNC/SYNCMEM) | Asynchronous (ASYNC) |
| RPO | Zero | Minutes (depends on lag) |
| Failover | Automatic (Pacemaker) | Manual promotion |
| Region | Same region (different zone) | Different region (paired) |
| VM sizing | Same size as primary | Can be smaller (scale up for DR) |
| Network | Low latency (intra-region) | Higher latency tolerated (inter-region) |
Exam tip: 3-node topology is the gold standard
When the exam describes a scenario needing both HA and DR for HANA, the answer is the 3-node multi-tier topology. Remember: sync for HA (local), async for DR (remote). The DR failover is always manual β you do not want automatic failover to a remote region based on a network blip.
After local failover
When Node 1 fails and Node 2 takes over locally:
- Node 2 becomes the new primary
- The async replication to Node 3 continues from Node 2
- When Node 1 is recovered, it re-registers as the secondary to Node 2
- No data loss, no DR impact
After regional failover
When Region A is completely lost:
- Confirm Region A is genuinely unavailable (not a transient issue)
- Promote Node 3 to primary in Region B
- Start application servers in Region B (via ASR or automation)
- Update DNS to point to Region B
- Node 3 is now the standalone primary until Region A recovers
ASR recovery plans for the application tier
Azure Site Recovery recovery plans define the failover sequence for the entire SAP landscape:
Group 1 β HANA database (manual step: confirm HSR failover or promote DR node)
Group 2 β ASCS/SCS (ASR failover: create ASCS VM from replicated disks)
Group 3 β Application servers (ASR failover: create app server VMs from replicated disks)
Group 4 β Web Dispatcher (ASR failover: create Web Dispatcher from replicated disks)
Each group waits for the previous group to complete before starting. Custom scripts can be added between groups (e.g., update DNS, configure load balancer, start SAP services).
π‘οΈ Lars nods. βSo the recovery plan ensures HANA is up before ASCS tries to connect, and ASCS is up before the application servers start looking for it.β
βοΈ Mei confirms. βExactly. SAP has a strict startup order. Recovery plans enforce it automatically.β
Custom scripts in recovery plans
ASR recovery plans support pre-action and post-action scripts (PowerShell or Azure Automation runbooks). Common SAP scripts include: updating DNS records, configuring Azure Load Balancer in the DR region, starting SAP services after VM boot, and running post-failover health checks. These scripts are critical for a fully automated DR failover.
Backup-based DR
For non-production systems or as a last resort for production:
- Azure Backup with GRS vaults β backups are replicated to the paired region automatically
- Cross-Region Restore β restore VMs and databases in the DR region directly from the GRS vault
- HANA backup via Backint can be restored to a new VM in the DR region
- RPO equals the last backup interval (hours)
- RTO is hours (restore time + system startup + configuration)
ANF cross-region replication
Azure NetApp Files supports cross-region replication for volumes:
- Replicates ANF volumes to a paired region on a schedule (10-minute, hourly, or daily)
- Useful for HANA shared volumes, transport directories, and backup staging
- The destination volume is read-only until failover
- During DR, break the replication and make the destination volume read-write
- Complements HSR β use ANF replication for /hana/shared and HSR for database data
DR testing
π‘οΈ Lars insists. βAn untested DR plan is not a plan β it is a wish. How do we test without disrupting production?β
Testing approaches:
- ASR test failover β creates VMs in an isolated VNet in the DR region without affecting production replication
- HANA DR node test β temporarily promote the DR node, verify data consistency, then re-register as secondary
- Tabletop exercise β walk through the runbook with all stakeholders without executing
- Full DR drill β execute the complete failover procedure on a scheduled maintenance window
Document everything:
- Failover time (actual RTO achieved)
- Data consistency verification (actual RPO verified)
- Client reconnection behavior
- Issues encountered and resolutions
- Update the runbook based on findings
Knowledge check
GlobalPharma needs both HA (RPO=0) and DR (RPO less than 15 minutes) for HANA. What topology should Lars implement?
Mei is creating an ASR recovery plan for SAP. What should be the correct order of failover groups?
Lars wants to test DR without impacting production. Which method should he use?
Summary
You have now completed Domain 3. You can implement the full HA/DR stack: HANA 3-node multi-tier replication for combined HA and DR, ASR recovery plans with ordered startup for the application tier, backup-based DR for non-production, ANF cross-region replication for shared volumes, and DR testing procedures. Combined with the ASCS HA and HANA HSR from earlier modules, your SAP system is resilient against both component failures and regional disasters.
Next up is Domain 4: keeping the SAP system running day after day. We will cover monitoring, backup, security, cost optimization, and ongoing operations.
π¬ Video coming soon