High Availability for Compute
Availability sets, availability zones, region pairs, and VMSS β design compute architectures that survive failures at every level, from a single rack to an entire region.
Designing compute high availability
High availability means your application keeps running when things break. The question is: what level of failure can you survive?
Rack failure: Availability sets spread VMs across fault domains (different racks/power/network).
Data centre failure: Availability zones spread VMs across physically separate buildings.
Region failure: Multi-region deployment with load balancing/failover.
Each level costs more but protects against bigger disasters.
Compute HA comparison
| Factor | Availability Sets | Availability Zones | Multi-Region |
|---|---|---|---|
| Protects against | Rack/hardware failure, planned maintenance | Data centre failure | Region-wide failure |
| SLA (VMs) | 99.95% | 99.99% | 99.99%+ (architecture dependent) |
| Latency between instances | Sub-millisecond (same data centre) | ~2ms (same region, different DC) | Variable (cross-region, 10-100+ ms) |
| Data replication | N/A β compute only | No inherent compute-state replication β depends on storage/data layer (ZRS disks, app-level replication) | Asynchronous (cross-region, app/storage dependent) |
| Cost | No extra cost (just placement) | No extra cost for zones | Double compute + storage + networking |
| Complexity | Low | Low-medium | High |
| Best for | Legacy apps, regions without zones | Standard production workloads | Mission-critical, global applications |
Exam tip: Availability Zones are the default answer for most scenarios
If the exam says βhigh availabilityβ without mentioning regional DR, Availability Zones is almost always the correct answer. They provide 99.99% SLA with minimal complexity and no extra cost. Availability Sets (99.95%) are for regions that donβt support zones or legacy configurations. Multi-region is only needed when the scenario explicitly requires regional failure protection.
PaaS high availability
Not just VMs β PaaS services have their own HA patterns:
| Service | HA Mechanism | SLA |
|---|---|---|
| App Service | Zone-redundant deployment (3+ instances across zones) | 99.99% |
| Azure Functions | Zone-redundant (Premium/Dedicated plan) | 99.99% |
| AKS | Zone-redundant node pools + system pods | 99.99% (with zones) |
| Container Apps | Built-in zone redundancy | 99.95% |
| Azure SQL | Zone-redundant configuration (Business Critical/Hyperscale) | 99.995% |
π Marcusβs HA architecture: NovaSaaS uses zone-redundant everything:
- AKS with node pools spread across 3 zones
- Azure SQL Business Critical with zone-redundant replicas
- Azure Cache for Redis with zone redundancy
- Result: any single data centre can fail without customer impact
Virtual Machine Scale Sets (VMSS)
VMSS provides auto-scaling with built-in HA:
| Feature | Description |
|---|---|
| Auto-scale | Scale out (add VMs) and in (remove VMs) based on metrics or schedule |
| Zone spreading | Automatically distributes VMs across availability zones |
| Rolling upgrades | Update VMs in batches without downtime |
| Overprovisioning | Create extra VMs during scale-out, delete extras once target is healthy |
| Orchestration modes | Uniform (identical VMs) or Flexible (mixed VM sizes) |
ποΈ Priyaβs VMSS design: GlobalTechβs web tier:
- Flexible orchestration with zone spreading across 3 availability zones
- Auto-scale rules: scale out at 70% CPU, scale in at 30%, min 6 / max 30 instances
- Health probes on the load balancer β unhealthy VMs auto-replaced
- Rolling upgrade policy β update 20% at a time, wait for health confirmation
Multi-region HA patterns
| Pattern | Description | RTO | Cost |
|---|---|---|---|
| Active-Active | Both regions serve traffic simultaneously | ~0 (seamless) | Highest (full capacity both regions) |
| Active-Passive (hot standby) | Secondary region running but not serving traffic | Minutes | High (idle compute in secondary) |
| Active-Passive (warm standby) | Secondary has reduced capacity, scales up on failover | 10-30 minutes | Medium (reduced compute in secondary) |
| Active-Passive (cold standby) | Secondary deployed but VMs stopped | Hours | Lowest (storage only until failover) |
π¦ Elenaβs multi-region design: Trading platform uses Active-Active (both regions handle trades). Customer portal uses Active-Passive hot standby (secondary ready to take over in minutes).
Knowledge check
ποΈ GlobalTech needs their customer-facing web application to survive a data centre failure with 99.99% availability. They want automatic scaling during peak hours. The application is stateless. What should Priya recommend?
π NovaSaaS needs their API tier to survive a full Azure region outage with automatic failover in under 60 seconds. The APIs are stateless and deployed to two regions. Users connect via custom domain with HTTPS. Which load balancing solution should Marcus use?
π¬ Video coming soon
Next up: Compute is highly available β now letβs do the same for data β High Availability for Data.