High Availability for Compute

Designing compute high availability

Simple explanation

High availability means your application keeps running when things break. The question is: what level of failure can you survive?

Rack failure: Availability sets spread VMs across fault domains (different racks/power/network).

Data centre failure: Availability zones spread VMs across physically separate buildings.

Region failure: Multi-region deployment with load balancing/failover.

Each level costs more but protects against bigger disasters.

Compute HA comparison

Compute High Availability Options
Factor	Availability Sets	Availability Zones	Multi-Region
Protects against	Rack/hardware failure, planned maintenance	Data centre failure	Region-wide failure
SLA (VMs)	99.95%	99.99%	99.99%+ (architecture dependent)
Latency between instances	Sub-millisecond (same data centre)	~2ms (same region, different DC)	Variable (cross-region, 10-100+ ms)
Data replication	N/A — compute only	No inherent compute-state replication — depends on storage/data layer (ZRS disks, app-level replication)	Asynchronous (cross-region, app/storage dependent)
Cost	No extra cost (just placement)	No extra cost for zones	Double compute + storage + networking
Complexity	Low	Low-medium	High
Best for	Legacy apps, regions without zones	Standard production workloads	Mission-critical, global applications

Exam tip: Availability Zones are the default answer for most scenarios

If the exam says “high availability” without mentioning regional DR, Availability Zones is almost always the correct answer. They provide 99.99% SLA with minimal complexity and no extra cost. Availability Sets (99.95%) are for regions that don’t support zones or legacy configurations. Multi-region is only needed when the scenario explicitly requires regional failure protection.

PaaS high availability

Not just VMs — PaaS services have their own HA patterns:

Service	HA Mechanism	SLA
App Service	Zone-redundant deployment (3+ instances across zones)	99.99%
Azure Functions	Zone-redundant (Premium/Dedicated plan)	99.99%
AKS	Zone-redundant node pools + system pods	99.99% (with zones)
Container Apps	Built-in zone redundancy	99.95%
Azure SQL	Zone-redundant configuration (Business Critical/Hyperscale)	99.995%

🚀 Marcus’s HA architecture: NovaSaaS uses zone-redundant everything:

AKS with node pools spread across 3 zones
Azure SQL Business Critical with zone-redundant replicas
Azure Cache for Redis with zone redundancy
Result: any single data centre can fail without customer impact

Virtual Machine Scale Sets (VMSS)

VMSS provides auto-scaling with built-in HA:

Feature	Description
Auto-scale	Scale out (add VMs) and in (remove VMs) based on metrics or schedule
Zone spreading	Automatically distributes VMs across availability zones
Rolling upgrades	Update VMs in batches without downtime
Overprovisioning	Create extra VMs during scale-out, delete extras once target is healthy
Orchestration modes	Uniform (identical VMs) or Flexible (mixed VM sizes)

🏗️ Priya’s VMSS design: GlobalTech’s web tier:

Flexible orchestration with zone spreading across 3 availability zones
Auto-scale rules: scale out at 70% CPU, scale in at 30%, min 6 / max 30 instances
Health probes on the load balancer — unhealthy VMs auto-replaced
Rolling upgrade policy — update 20% at a time, wait for health confirmation

Multi-region HA patterns

Pattern	Description	RTO	Cost
Active-Active	Both regions serve traffic simultaneously	~0 (seamless)	Highest (full capacity both regions)
Active-Passive (hot standby)	Secondary region running but not serving traffic	Minutes	High (idle compute in secondary)
Active-Passive (warm standby)	Secondary has reduced capacity, scales up on failover	10-30 minutes	Medium (reduced compute in secondary)
Active-Passive (cold standby)	Secondary deployed but VMs stopped	Hours	Lowest (storage only until failover)

🏦 Elena’s multi-region design: Trading platform uses Active-Active (both regions handle trades). Customer portal uses Active-Passive hot standby (secondary ready to take over in minutes).

Knowledge check

Question

What SLA do Availability Zones provide for VMs?

Click or press Enter to reveal answer

Answer

99.99% (about 4.38 minutes downtime per month). This is higher than Availability Sets (99.95%) because zones are physically separate data centres, protecting against entire data centre failures — not just rack failures.

Click to flip back

Question

What's the difference between Active-Active and Active-Passive multi-region?

Click or press Enter to reveal answer

Answer

Active-Active: both regions serve traffic simultaneously — zero RTO, highest cost. Active-Passive: secondary region is standby (hot/warm/cold) — RTO depends on readiness level. Active-Active requires stateless design or data synchronisation between regions.

Click to flip back

Question

When should you choose Application Gateway over Azure Load Balancer?

Click or press Enter to reveal answer

Answer

Application Gateway operates at Layer 7 (HTTP/HTTPS) — use when you need URL-based routing, SSL offloading, cookie-based session affinity, or WAF. Azure Load Balancer operates at Layer 4 (TCP/UDP) — use for non-HTTP protocols, ultra-low latency, or zone-redundant load balancing. For web apps, Application Gateway is usually the right choice.

Click to flip back

Knowledge Check

🏗️ GlobalTech needs their customer-facing web application to survive a data centre failure with 99.99% availability. They want automatic scaling during peak hours. The application is stateless. What should Priya recommend?

Knowledge Check

🚀 NovaSaaS needs their API tier to survive a full Azure region outage with automatic failover in under 60 seconds. The APIs are stateless and deployed to two regions. Users connect via custom domain with HTTPS. Which load balancing solution should Marcus use?

Next up: Compute is highly available — now let’s do the same for data — High Availability for Data.