High Availability for Data
SQL Always On, failover groups, Cosmos DB multi-region, and geo-redundant storage β design data architectures that survive failures without losing a single transaction.
Designing data high availability
Compute HA keeps your app running. Data HA keeps your data accessible. Theyβre separate concerns β your VMs might survive a failure, but if the database is down, the app is useless.
Key patterns: Zone-redundant databases (survive data centre failure), failover groups (automatic regional failover for SQL), Cosmos DB multi-region writes (globally distributed, always available), and geo-redundant storage (data replicated to paired region).
Relational data HA
Azure SQL HA patterns
| Option | Scope | Failover | RPO | Best For |
|---|---|---|---|---|
| Zone-redundant config | Within-region (across zones) | Automatic | 0 (synchronous) | Data centre failure protection β standard HA |
| Active geo-replication | Cross-region | Manual | ~5 seconds (async) | Read offloading + manual DR |
| Auto-failover groups | Cross-region | Automatic | ~5 seconds (async) | Automatic regional DR with endpoint redirection |
Auto-failover groups deep dive
Auto-failover groups are the recommended pattern for regional SQL DR:
| Feature | Detail |
|---|---|
| Automatic failover | Detects region outage and fails over within the grace period |
| Grace period | Configurable (default 1 hour) β prevents false positives |
| Read-write endpoint | <group-name>.database.windows.net β always points to primary |
| Read-only endpoint | <group-name>.secondary.database.windows.net β always points to secondary |
| Application impact | Connection strings donβt change β endpoints redirect automatically |
π¦ Elenaβs SQL HA design:
- Zone-redundant Business Critical for within-region HA (survives data centre failure)
- Auto-failover group to the paired region (survives regional outage)
- Grace period: 30 minutes β short enough for fast failover, long enough to avoid false triggers
- Application uses the failover group endpoint β no connection string changes during failover
Exam tip: Auto-failover groups vs active geo-replication
Both replicate across regions, but auto-failover groups are preferred because:
- Automatic failover (geo-replication requires manual or app-level failover)
- Endpoint redirection (app connection strings donβt change)
- Group multiple databases (failover all databases together, not one at a time)
Choose active geo-replication only when you need more than 1 secondary, or need secondaries in regions other than the paired region (failover groups can target any region, but paired regions are recommended for operational reasons).
Cosmos DB multi-region HA
Cosmos DB is built for global distribution:
| Configuration | Write Regions | Read Regions | Consistency | SLA |
|---|---|---|---|---|
| Single-region | 1 | 1 | Any | 99.99% |
| Multi-region (single write) | 1 | 1-30+ | Any | 99.999% (reads) |
| Multi-region (multi-write) | 2+ | All write regions | Session, Consistent Prefix, Eventual | 99.999% (reads + writes) |
π Marcusβs Cosmos DB HA: NovaSaaS operates globally:
- Multi-region writes in 3 regions (US East, West Europe, Southeast Asia)
- Session consistency β each user sees their own writes
- Automatic failover enabled β if a region goes down, Cosmos DB promotes another region
- Conflict resolution: Last-writer-wins for most containers, custom merge for shopping carts
Design decision: Multi-write tradeoffs
Multi-region writes give the highest availability (99.999% write SLA) but:
- Limits consistency: Strong and Bounded Staleness are NOT available with multi-write
- Requires conflict resolution: Concurrent writes to same item in different regions need a resolution policy
- Higher cost: RU charges in every write region
Recommendation: Use multi-write for global apps where latency matters and eventual/session consistency is acceptable. Use single-write + multi-read for apps needing stronger consistency.
Storage HA
| Redundancy | Within-Region HA | Cross-Region DR | Read from Secondary |
|---|---|---|---|
| LRS | No (single DC) | No | No |
| ZRS | Yes (3 zones) | No | No |
| GRS | No (single DC) | Yes (paired region) | No (failover required) |
| GZRS | Yes (3 zones) | Yes (paired region) | No (failover required) |
| RA-GRS | No (single DC) | Yes (paired region) | Yes (read-only secondary) |
| RA-GZRS | Yes (3 zones) | Yes (paired region) | Yes (read-only secondary) |
ποΈ Priyaβs storage HA: GlobalTech uses RA-GZRS for critical data:
- Zone redundancy in primary region β survives data centre failure
- Geo-redundancy β data replicated to paired region
- Read-access secondary β applications can read from secondary endpoint for resilience
Knowledge check
π¦ Elena needs Azure SQL to automatically fail over to a secondary region if the primary region has an outage. The application should not need connection string changes. Multiple databases must fail over together. Which feature should she recommend?
π NovaSaaS needs Cosmos DB to be writable from 3 regions simultaneously for low-latency global access. They can accept session-level consistency. What configuration should Marcus recommend?
π¬ Video coming soon
Domain 3 complete! Youβve designed recovery objectives, backup strategies, and high availability for compute and data.
Next up: Now letβs design the infrastructure itself β Compute Design: VMs & When to Use Them.