Disaster Recovery and Multi-Region
Plan and implement disaster recovery and multi-region strategies for Azure Virtual Desktop, including backup and failover.
Disaster Recovery and Multi-Region
What happens if an entire Azure region goes offline? Your monitoring dashboards go dark, your session hosts disappear, and thousands of users cannot work. Disaster recovery for AVD is about making sure that βregion downβ does not mean βbusiness down.β
Here is the good news: the AVD control plane (the service that brokers connections, manages host pools, and handles authentication) is Microsoft-managed and multi-region by default. If one Azure region fails, the control plane keeps running.
Here is the catch: the data plane β your session host VMs, user profiles, virtual networks, and images β is YOUR responsibility. If you only deployed in one region and that region fails, everything in it is gone until the region recovers.
What Needs Protection?
Before designing a DR plan, identify every component and who is responsible:
| Component | Responsible party | DR mechanism |
|---|---|---|
| AVD control plane (brokering, gateway) | Microsoft | Built-in multi-region |
| Session host VMs (pooled) | You | Deploy from replicated image in secondary region |
| Session host VMs (personal) | You | Azure Site Recovery to secondary region |
| FSLogix profiles | You | Cloud Cache or geo-redundant storage replication |
| Golden images | You | Azure Compute Gallery cross-region replication |
| Virtual network and NSGs | You | Pre-configured VNet in secondary region |
| DNS resolution | You | Azure Traffic Manager or Front Door for failover |
| Active Directory/Entra ID | Shared | AD DS: deploy DCs in both regions. Entra ID: Microsoft-managed |
Active-Active vs Active-Passive DR
There are two main approaches to multi-region AVD. The right choice depends on your budget and tolerance for downtime.
Active-Active
Both regions are running at all times. Users connect to the closest region automatically. If one region fails, the other absorbs the load.
How it works:
- Host pools deployed in both regions (e.g., Australia East and Southeast Asia)
- Users are directed to the nearest region via Azure Traffic Manager or Azure Front Door
- FSLogix Cloud Cache replicates profiles to both regions in real time
- If Region A fails, Traffic Manager routes all users to Region B
- No failover delay β Region B is already running
Trade-off: You pay for compute in both regions at all times. Cost is roughly double.
Active-Passive
Only the primary region runs day to day. The secondary region has infrastructure pre-configured but VMs are deallocated or minimal.
How it works:
- Primary region runs all session hosts normally
- Secondary region has the VNet, NSGs, golden image, and optionally a few test hosts β but production VMs are not running
- FSLogix profiles replicate to secondary region via Cloud Cache or geo-redundant storage
- When primary fails, you start VMs in the secondary region and redirect users
- Failover takes 15-60 minutes depending on how much is pre-staged
Trade-off: Lower cost during normal operations, but there is downtime during failover.
π’ Rajβs APAC + Europe strategy: TerraStack has offices in Sydney and London. Raj deploys active-active: a host pool in Australia East and another in UK South. Azure Traffic Manager uses geographic routing β Australian users connect to Australia East, European users to UK South. If Australia East goes down, all users fail over to UK South. FSLogix Cloud Cache keeps profiles in sync across both regions. Andrea approves the doubled compute cost because the company cannot afford downtime β every hour of outage costs more than a month of VM bills.
| Aspect | Active-Active | Active-Passive |
|---|---|---|
| Regions in use | Both running simultaneously | Primary runs, secondary on standby |
| User routing | Automatic β nearest region via Traffic Manager | Manual or automated failover redirection |
| Failover time | Near-zero (seconds) | 15-60 minutes to start VMs and redirect |
| Cost | High β paying for compute in both regions | Lower β secondary region has minimal running resources |
| Profile sync | Real-time via FSLogix Cloud Cache | Periodic replication (some data loss possible) |
| Complexity | Higher β maintain two identical environments | Lower β secondary is simpler |
| Best for | Mission-critical, zero-downtime requirements | Cost-sensitive environments with acceptable RTO |
| Data loss risk (RPO) | Near-zero | Minutes to hours depending on replication method |
Key DR Components in Detail
Golden Image Replication
Your golden images must exist in both regions. Azure Compute Gallery supports cross-region replication β when you create a new image version, it automatically copies to target regions.
Configure this when creating or updating an image definition:
- Add your secondary region as a replication target
- Set the replica count (at least 1 per region)
- Images replicate asynchronously β allow time before relying on the copy
FSLogix Profile DR
Profiles are the most critical user data. Two approaches:
FSLogix Cloud Cache β The preferred option. Cloud Cache writes profile data to multiple storage locations simultaneously. You configure two (or more) Azure Files shares in different regions. When a user saves a file, it is written to both locations in real time. If one region fails, the other has an up-to-date copy.
Geo-redundant storage (GRS) β Azure Files with GRS replicates data to a paired region asynchronously. Simpler to configure but has two downsides: replication lag (you may lose recent changes) and you cannot read from the secondary until Microsoft initiates a storage failover.
Deep Dive β FSLogix Cloud Cache Configuration
Cloud Cache is configured in the FSLogix Group Policy or registry settings. The key setting is CCDLocations, which specifies multiple storage providers:
Example with two Azure Files shares:
type=smb,connectionString=\\primary.file.core.windows.net\profiles;type=smb,connectionString=\\secondary.file.core.windows.net\profiles
Cloud Cache maintains a local cache on the session host and writes to both remote locations. If one provider is unreachable, the local cache keeps working and syncs when connectivity returns. The βhealthyβ provider is always used for reads.
Important: Cloud Cache increases local disk I/O on session hosts because it maintains a local copy. Size your OS disks accordingly.
Azure Site Recovery for Personal VMs
For personal host pools, each VM has unique data. Azure Site Recovery (ASR) continuously replicates VMs to the secondary region:
- Replication is near-real-time (RPO of seconds to minutes)
- Failover creates identical VMs in the secondary region
- Test failover lets you validate DR without affecting production
- After the primary region recovers, you can fail back
ASR is not needed for pooled hosts β you just deploy new VMs from the replicated image.
DNS-Based Failover
Users need to reach the right region. Two options:
- Azure Traffic Manager β DNS-based load balancing. Returns the IP of the healthy region. Failover happens when health probes detect the primary is down.
- Azure Front Door β Layer 7 global load balancer with faster failover (detects failures at the HTTP level, not just DNS).
Both can route users to the closest healthy region automatically.
Multi-Region Networking
Your secondary region needs a complete network stack:
- Virtual Network β Matching address space design (but non-overlapping CIDRs with primary)
- VNet Peering β If cross-region communication is needed (e.g., shared services)
- NSG rules β Mirror the primary regionβs rules
- Firewall rules β Consistent outbound rules (AVD requires specific URLs to be reachable)
- Domain Controllers β If using AD DS, deploy DCs in both regions. If using Entra ID only, this is handled for you.
- ExpressRoute or VPN β If on-premises connectivity is required, ensure the secondary region also has a path back to corporate
π Priyaβs multi-country resilience: NomadTech has 200 remote workers across 12 countries. Priya uses active-passive with West Europe as primary and East US as secondary. FSLogix Cloud Cache keeps profiles synced. She runs quarterly DR drills: spin up session hosts in East US, verify profiles load correctly, run a sample of apps, then tear down. When West Europe had a 4-hour outage last quarter, Priya activated the secondary region in 20 minutes. Ben (creative director) and the design team were back in Figma within half an hour.
DR Testing and Validation
A DR plan that is never tested is not a plan β it is a hope. Build regular testing into your operations:
- Validation host pool β Deploy a small host pool in your secondary region. Have a test group connect to it monthly to verify images, profiles, and apps work.
- ASR test failover β Azure Site Recovery has a βtest failoverβ feature that spins up replicated VMs in an isolated network. Use it quarterly.
- Profile restore drill β Restore a profile from Cloud Cache secondary or backup. Verify the user sees their expected data.
- Full failover drill β Annually (or semi-annually), simulate a complete primary region failure. Redirect real users to the secondary region for a few hours.
ποΈ JCβs compliance requirement: The Federal Department of Civil Infrastructure has a mandate: DR failover must complete within 4 hours (RTO) with no more than 1 hour of data loss (RPO). Director Walsh requires documented evidence of quarterly DR tests. JC runs ASR test failovers every quarter, logs the results, and has Aisha (security auditor) sign off. The last drill achieved RTO of 22 minutes and RPO of 3 minutes β well within the mandate.
Exam Tip β RTO vs RPO
RTO (Recovery Time Objective) β How long can you be down? It is the maximum acceptable time between the disaster and full recovery. Active-active gives near-zero RTO. Active-passive RTO depends on how much is pre-staged.
RPO (Recovery Point Objective) β How much data can you lose? It is the maximum acceptable time between the last backup/replication and the disaster. Cloud Cache gives near-zero RPO. GRS may have minutes to hours of lag.
The exam often gives you RTO/RPO requirements and asks you to choose the right architecture.
Putting It All Together β DR Checklist
Use this checklist to verify your DR plan covers everything:
- Golden images replicated to secondary region via Compute Gallery
- FSLogix profiles synced via Cloud Cache or GRS
- VNet, NSGs, and firewall rules pre-configured in secondary region
- Domain controllers (if AD DS) deployed in secondary region
- DNS failover configured (Traffic Manager or Front Door)
- Personal VMs replicated with Azure Site Recovery (if applicable)
- Scaling plan created for secondary region host pool
- DR runbook documented with step-by-step failover procedure
- Regular DR drills scheduled and logged
- RTO and RPO validated against business requirements
Flashcards
Knowledge Check
Raj is designing multi-region AVD for TerraStack. The business requires near-zero downtime and no data loss. Which architecture should he recommend?
JC needs to protect personal desktop VMs for government staff so they can be recovered in a secondary region. Which Azure service should he use?
Priya runs quarterly DR drills for NomadTech's active-passive setup. She wants to test VM failover without affecting production users. Which feature should she use?
Congratulations β You Have Completed AZ-140 Domain 4! π
You have worked through all four modules of Monitor and Maintain:
- Monitoring β Azure Monitor, AVD Insights, alerts, and performance optimisation
- Autoscaling β Scaling plans, Start VM on Connect, drain mode, and session management
- Updates and Backups β Golden image patching, Azure Update Manager, FSLogix backup, and Compute Gallery versioning
- Disaster Recovery β Active-active vs active-passive, Cloud Cache, ASR, multi-region networking, and DR testing
This completes the AZ-140 study guide. You now have the knowledge foundation to design, deploy, manage, and protect Azure Virtual Desktop environments. Review the modules you found most challenging, practice with hands-on labs, and remember: the exam tests decision-making β knowing which solution fits which scenario.
Good luck on your AZ-140 exam! π
Back to AZ-140 Study Guide Home
π¬ Video coming soon
Disaster Recovery and Multi-Region