High Availability for Data

Designing data high availability

Simple explanation

Compute HA keeps your app running. Data HA keeps your data accessible. They’re separate concerns — your VMs might survive a failure, but if the database is down, the app is useless.

Key patterns: Zone-redundant databases (survive data centre failure), failover groups (automatic regional failover for SQL), Cosmos DB multi-region writes (globally distributed, always available), and geo-redundant storage (data replicated to paired region).

Relational data HA

Azure SQL HA patterns

Azure SQL High Availability Options
Option	Scope	Failover	RPO	Best For
Zone-redundant config	Within-region (across zones)	Automatic	0 (synchronous)	Data centre failure protection — standard HA
Active geo-replication	Cross-region	Manual	~5 seconds (async)	Read offloading + manual DR
Auto-failover groups	Cross-region	Automatic	~5 seconds (async)	Automatic regional DR with endpoint redirection

Auto-failover groups deep dive

Auto-failover groups are the recommended pattern for regional SQL DR:

Feature	Detail
Automatic failover	Detects region outage and fails over within the grace period
Grace period	Configurable (default 1 hour) — prevents false positives
Read-write endpoint	`<group-name>.database.windows.net` — always points to primary
Read-only endpoint	`<group-name>.secondary.database.windows.net` — always points to secondary
Application impact	Connection strings don’t change — endpoints redirect automatically

🏦 Elena’s SQL HA design:

Zone-redundant Business Critical for within-region HA (survives data centre failure)
Auto-failover group to the paired region (survives regional outage)
Grace period: 30 minutes — short enough for fast failover, long enough to avoid false triggers
Application uses the failover group endpoint — no connection string changes during failover

Exam tip: Auto-failover groups vs active geo-replication

Both replicate across regions, but auto-failover groups are preferred because:

Automatic failover (geo-replication requires manual or app-level failover)
Endpoint redirection (app connection strings don’t change)
Group multiple databases (failover all databases together, not one at a time)

Choose active geo-replication only when you need more than 1 secondary, or need secondaries in regions other than the paired region (failover groups can target any region, but paired regions are recommended for operational reasons).

Cosmos DB multi-region HA

Cosmos DB is built for global distribution:

Configuration	Write Regions	Read Regions	Consistency	SLA
Single-region	1	1	Any	99.99%
Multi-region (single write)	1	1-30+	Any	99.999% (reads)
Multi-region (multi-write)	2+	All write regions	Session, Consistent Prefix, Eventual	99.999% (reads + writes)

🚀 Marcus’s Cosmos DB HA: NovaSaaS operates globally:

Multi-region writes in 3 regions (US East, West Europe, Southeast Asia)
Session consistency — each user sees their own writes
Automatic failover enabled — if a region goes down, Cosmos DB promotes another region
Conflict resolution: Last-writer-wins for most containers, custom merge for shopping carts

Design decision: Multi-write tradeoffs

Multi-region writes give the highest availability (99.999% write SLA) but:

Limits consistency: Strong and Bounded Staleness are NOT available with multi-write
Requires conflict resolution: Concurrent writes to same item in different regions need a resolution policy
Higher cost: RU charges in every write region

Recommendation: Use multi-write for global apps where latency matters and eventual/session consistency is acceptable. Use single-write + multi-read for apps needing stronger consistency.

Storage HA

Redundancy	Within-Region HA	Cross-Region DR	Read from Secondary
LRS	No (single DC)	No	No
ZRS	Yes (3 zones)	No	No
GRS	No (single DC)	Yes (paired region)	No (failover required)
GZRS	Yes (3 zones)	Yes (paired region)	No (failover required)
RA-GRS	No (single DC)	Yes (paired region)	Yes (read-only secondary)
RA-GZRS	Yes (3 zones)	Yes (paired region)	Yes (read-only secondary)

🏗️ Priya’s storage HA: GlobalTech uses RA-GZRS for critical data:

Zone redundancy in primary region — survives data centre failure
Geo-redundancy — data replicated to paired region
Read-access secondary — applications can read from secondary endpoint for resilience

Knowledge check

Question

What's the advantage of SQL auto-failover groups over active geo-replication?

Click or press Enter to reveal answer

Answer

Three advantages: (1) Automatic failover detection and promotion, (2) Endpoint redirection — applications use a single connection string that always points to the active primary, (3) Group multiple databases — failover all databases together as a unit.

Click to flip back

Question

What consistency levels are available with Cosmos DB multi-region writes?

Click or press Enter to reveal answer

Answer

Session, Consistent Prefix, and Eventual. Strong and Bounded Staleness are NOT available with multi-write because synchronous cross-region consistency would negate the low-latency benefit of local writes.

Click to flip back

Question

What does RA-GZRS provide that GZRS doesn't?

Click or press Enter to reveal answer

Answer

Read access to the secondary region endpoint WITHOUT requiring a failover. RA-GZRS = Zone-redundant (primary) + Geo-redundant (secondary) + Read Access (secondary). Applications can read from the secondary for resilience even when the primary is healthy.

Click to flip back

Knowledge Check

🏦 Elena needs Azure SQL to automatically fail over to a secondary region if the primary region has an outage. The application should not need connection string changes. Multiple databases must fail over together. Which feature should she recommend?

Knowledge Check

🚀 NovaSaaS needs Cosmos DB to be writable from 3 regions simultaneously for low-latency global access. They can accept session-level consistency. What configuration should Marcus recommend?

Domain 3 complete! You’ve designed recovery objectives, backup strategies, and high availability for compute and data.

Next up: Now let’s design the infrastructure itself — Compute Design: VMs & When to Use Them.