Disaster Recovery and Multi-Region

What happens if an entire Azure region goes offline? Your monitoring dashboards go dark, your session hosts disappear, and thousands of users cannot work. Disaster recovery for AVD is about making sure that “region down” does not mean “business down.”

Here is the good news: the AVD control plane (the service that brokers connections, manages host pools, and handles authentication) is Microsoft-managed and multi-region by default. If one Azure region fails, the control plane keeps running.

Here is the catch: the data plane — your session host VMs, user profiles, virtual networks, and images — is YOUR responsibility. If you only deployed in one region and that region fails, everything in it is gone until the region recovers.

Simple explanation

What Needs Protection?

Before designing a DR plan, identify every component and who is responsible:

Component	Responsible party	DR mechanism
AVD control plane (brokering, gateway)	Microsoft	Built-in multi-region
Session host VMs (pooled)	You	Deploy from replicated image in secondary region
Session host VMs (personal)	You	Azure Site Recovery to secondary region
FSLogix profiles	You	Cloud Cache or geo-redundant storage replication
Golden images	You	Azure Compute Gallery cross-region replication
Virtual network and NSGs	You	Pre-configured VNet in secondary region
DNS resolution	You	Pre-configured host pools in secondary region, reassign users during failover
Active Directory/Entra ID	Shared	AD DS: deploy DCs in both regions. Entra ID: Microsoft-managed

Active-Active vs Active-Passive DR

There are two main approaches to multi-region AVD. The right choice depends on your budget and tolerance for downtime.

Active-Active

Both regions are running at all times. Users connect to the closest region automatically. If one region fails, the other absorbs the load.

How it works:

Host pools deployed in both regions (e.g., Australia East and Southeast Asia)
Users are assigned to application groups in both regions
The AVD control plane (Microsoft-managed) handles connection routing — there is no need for Azure Traffic Manager or Front Door, as AVD has its own built-in global gateway
FSLogix Cloud Cache replicates profiles to both regions in real time
If Region A fails, users connect through the secondary host pool automatically (the AVD gateway routes to healthy session hosts)
No failover delay — Region B is already running

Trade-off: You pay for compute in both regions at all times. Cost is roughly double.

Active-Passive

Only the primary region runs day to day. The secondary region has infrastructure pre-configured but VMs are deallocated or minimal.

How it works:

Primary region runs all session hosts normally
Secondary region has the VNet, NSGs, golden image, and optionally a few test hosts — but production VMs are not running
FSLogix profiles replicate to secondary region via Cloud Cache or geo-redundant storage
When primary fails, you start VMs in the secondary region and redirect users
Failover takes 15-60 minutes depending on how much is pre-staged

Trade-off: Lower cost during normal operations, but there is downtime during failover.

🏢 Raj’s APAC + Europe strategy: TerraStack has offices in Sydney and London. Raj deploys active-active: a host pool in Australia East and another in UK South. Azure Traffic Manager uses geographic routing — Australian users connect to Australia East, European users to UK South. If Australia East goes down, all users fail over to UK South. FSLogix Cloud Cache keeps profiles in sync across both regions. Andrea approves the doubled compute cost because the company cannot afford downtime — every hour of outage costs more than a month of VM bills.

Aspect	Active-Active	Active-Passive
Regions in use	Both running simultaneously	Primary runs, secondary on standby
User routing	AVD gateway routes to healthy hosts across regions	Manual reassignment to secondary host pool or automated failover script
Failover time	Near-zero (seconds)	15-60 minutes to start VMs and redirect
Cost	High — paying for compute in both regions	Lower — secondary region has minimal running resources
Profile sync	Real-time via FSLogix Cloud Cache	Periodic replication (some data loss possible)
Complexity	Higher — maintain two identical environments	Lower — secondary is simpler
Best for	Mission-critical, zero-downtime requirements	Cost-sensitive environments with acceptable RTO
Data loss risk (RPO)	Near-zero	Minutes to hours depending on replication method

Key DR Components in Detail

Golden Image Replication

Your golden images must exist in both regions. Azure Compute Gallery supports cross-region replication — when you create a new image version, it automatically copies to target regions.

Configure this when creating or updating an image definition:

Add your secondary region as a replication target
Set the replica count (at least 1 per region)
Images replicate asynchronously — allow time before relying on the copy

FSLogix Profile DR

Profiles are the most critical user data. Two approaches:

FSLogix Cloud Cache — The preferred option. Cloud Cache writes profile data to multiple storage locations simultaneously. You configure two (or more) Azure Files shares in different regions. When a user saves a file, it is written to both locations in real time. If one region fails, the other has an up-to-date copy.

Geo-redundant storage (GRS) — Azure Files with GRS replicates data to a paired region asynchronously. Simpler to configure but has two downsides: replication lag (you may lose recent changes) and you cannot read from the secondary until Microsoft initiates a storage failover.

Deep Dive — FSLogix Cloud Cache Configuration

Cloud Cache is configured in the FSLogix Group Policy or registry settings. The key setting is CCDLocations, which specifies multiple storage providers:

Example with two Azure Files shares: type=smb,connectionString=\\primary.file.core.windows.net\profiles;type=smb,connectionString=\\secondary.file.core.windows.net\profiles

Cloud Cache maintains a local cache on the session host and writes to both remote locations. If one provider is unreachable, the local cache keeps working and syncs when connectivity returns. The “healthy” provider is always used for reads.

Important: Cloud Cache increases local disk I/O on session hosts because it maintains a local copy. Size your OS disks accordingly.

Azure Site Recovery for Personal VMs

For personal host pools, each VM has unique data. Azure Site Recovery (ASR) continuously replicates VMs to the secondary region:

Replication is near-real-time (RPO of seconds to minutes)
Failover creates identical VMs in the secondary region
Test failover lets you validate DR without affecting production
After the primary region recovers, you can fail back

ASR is not needed for pooled hosts — you just deploy new VMs from the replicated image.

User Reassignment During Failover

AVD DR uses primary and secondary host pools with user reassignment, not Azure Traffic Manager or Front Door for session steering. The AVD control plane already has a built-in global gateway that routes connections. During failover:

Active-active: Users are assigned to app groups in both regions. The AVD gateway routes to available session hosts.
Active-passive: Reassign users to the secondary host pool’s application groups (manually or via automation script). Start VMs in the secondary region.

Note: Azure Traffic Manager and Azure Front Door are NOT used for AVD session routing — the AVD gateway handles this natively. Traffic Manager/Front Door are relevant for other Azure services but not for AVD connection brokering.

Multi-Region Networking

Your secondary region needs a complete network stack:

Virtual Network — Matching address space design (but non-overlapping CIDRs with primary)
VNet Peering — If cross-region communication is needed (e.g., shared services)
NSG rules — Mirror the primary region’s rules
Firewall rules — Consistent outbound rules (AVD requires specific URLs to be reachable)
Domain Controllers — If using AD DS, deploy DCs in both regions. If using Entra ID only, this is handled for you.
ExpressRoute or VPN — If on-premises connectivity is required, ensure the secondary region also has a path back to corporate

🌐 Priya’s multi-country resilience: NomadTech has 200 remote workers across 12 countries. Priya uses active-passive with West Europe as primary and East US as secondary. FSLogix Cloud Cache keeps profiles synced. She runs quarterly DR drills: spin up session hosts in East US, verify profiles load correctly, run a sample of apps, then tear down. When West Europe had a 4-hour outage last quarter, Priya activated the secondary region in 20 minutes. Ben (creative director) and the design team were back in Figma within half an hour.

DR Testing and Validation

A DR plan that is never tested is not a plan — it is a hope. Build regular testing into your operations:

Validation host pool — Deploy a small host pool in your secondary region. Have a test group connect to it monthly to verify images, profiles, and apps work.
ASR test failover — Azure Site Recovery has a “test failover” feature that spins up replicated VMs in an isolated network. Use it quarterly.
Profile restore drill — Restore a profile from Cloud Cache secondary or backup. Verify the user sees their expected data.
Full failover drill — Annually (or semi-annually), simulate a complete primary region failure. Redirect real users to the secondary region for a few hours.

🏛️ JC’s compliance requirement: The Federal Department of Civil Infrastructure has a mandate: DR failover must complete within 4 hours (RTO) with no more than 1 hour of data loss (RPO). Director Walsh requires documented evidence of quarterly DR tests. JC runs ASR test failovers every quarter, logs the results, and has Aisha (security auditor) sign off. The last drill achieved RTO of 22 minutes and RPO of 3 minutes — well within the mandate.

Exam Tip — RTO vs RPO

RTO (Recovery Time Objective) — How long can you be down? It is the maximum acceptable time between the disaster and full recovery. Active-active gives near-zero RTO. Active-passive RTO depends on how much is pre-staged.

RPO (Recovery Point Objective) — How much data can you lose? It is the maximum acceptable time between the last backup/replication and the disaster. Cloud Cache gives near-zero RPO. GRS may have minutes to hours of lag.

The exam often gives you RTO/RPO requirements and asks you to choose the right architecture.

Putting It All Together — DR Checklist

Use this checklist to verify your DR plan covers everything:

Golden images replicated to secondary region via Compute Gallery
FSLogix profiles synced via Cloud Cache or GRS
VNet, NSGs, and firewall rules pre-configured in secondary region
Domain controllers (if AD DS) deployed in secondary region
Users assigned to application groups in secondary host pool
DNS failover NOT needed — AVD gateway routes natively
Personal VMs replicated with Azure Site Recovery (if applicable)
Scaling plan created for secondary region host pool
DR runbook documented with step-by-step failover procedure
Regular DR drills scheduled and logged
RTO and RPO validated against business requirements