Recovery Objectives: RPO, RTO & SLA
Before designing backup or DR, you need to know the targets. RPO, RTO, and SLA are the numbers that drive every business continuity architecture decision.
Understanding recovery objectives
Imagine your house floods. Two questions: How much stuff can you afford to lose? (RPO — Recovery Point Objective) and How quickly do you need to be back in a liveable house? (RTO — Recovery Time Objective).
RPO = how much data loss is acceptable. RPO of 1 hour means you can lose up to 1 hour of data. RPO of 0 means zero data loss.
RTO = how long downtime is acceptable. RTO of 4 hours means the service must be back within 4 hours. RTO of 0 means no downtime.
SLA = the availability percentage Azure guarantees. 99.99% = 4.38 minutes of downtime per month.
RPO and RTO in practice
| RPO Target | What It Means | Technology Required | Cost |
|---|---|---|---|
| 0 (zero data loss) | Every transaction must survive a failure | Synchronous replication (same-region AZ) | Highest |
| < 5 minutes | Near-real-time replication | Asynchronous replication (geo-replication) | High |
| < 1 hour | Recent state preserved | Frequent automated backups | Medium |
| < 24 hours | Yesterday’s data preserved | Daily backups | Low |
| < 7 days | Weekly checkpoint | Weekly backups with retention | Lowest |
| RTO Target | What It Means | Technology Required | Cost |
|---|---|---|---|
| 0 (no downtime) | Instant failover, always active | Active-active deployment, multi-region | Highest |
| < 15 minutes | Automated failover | Hot standby, auto-failover groups | High |
| < 4 hours | Manual failover with prepared runbook | Warm standby, Azure Site Recovery | Medium |
| < 24 hours | Restore from backup | Backup + restore procedures, cold standby | Low |
| < 72 hours | Extended recovery acceptable | Archive restore, rebuild from scratch | Lowest |
🏦 Elena’s recovery tiers: FinSecure Bank classifies workloads into tiers:
| Tier | Workload | RPO | RTO | Solution |
|---|---|---|---|---|
| Tier 1 (Critical) | Trading platform | 0 | 0 | Active-active, synchronous replication |
| Tier 2 (Important) | Customer portal | 5 min | 15 min | Geo-replication, auto-failover |
| Tier 3 (Standard) | Internal reporting | 1 hour | 4 hours | Azure Backup, ASR warm standby |
| Tier 4 (Non-critical) | Dev/test environments | 24 hours | 24 hours | Daily backup, rebuild from IaC |
Exam tip: Not everything needs Tier 1 protection
A common exam trap: designing maximum protection for everything. The correct answer considers business impact and cost. If the scenario says “internal reporting dashboard” — it doesn’t need active-active multi-region deployment. Match the investment to the impact of downtime.
Composite SLA calculation
When your architecture uses multiple Azure services, the composite SLA is the product of individual SLAs:
| Service | Individual SLA |
|---|---|
| Azure App Service | 99.95% |
| Azure SQL Database | 99.99% |
| Azure Storage (GRS) | 99.99% |
Composite SLA = 0.9995 x 0.9999 x 0.9999 = 99.93% (about 30 minutes downtime/month)
Improving composite SLA
| Technique | How It Helps |
|---|---|
| Availability Zones | Survives data centre failure — increases individual service SLA |
| Multi-region deployment | Survives regional failure — can achieve 99.999%+ |
| Redundant paths | If component A fails, component B handles traffic |
| Queue-based decoupling | Services communicate asynchronously — one failure doesn’t cascade |
🏗️ Priya’s SLA design: GlobalTech needs 99.99% for their customer portal. A single-region deployment gives 99.93%. Priya added:
- Multi-region App Service with Traffic Manager failover
- SQL auto-failover group (secondary in paired region)
- Result: individual failures don’t cause customer-visible downtime
Well-Architected Framework connection
Reliability pillar is entirely about meeting recovery objectives:
- Design for failure at every layer
- Quantify RPO/RTO/SLA before choosing technologies
- Test recovery procedures regularly (chaos engineering)
- Balance availability investment with business value
Cost Optimisation: Every additional “nine” of availability costs significantly more. 99.9% → 99.99% might double your infrastructure cost. The architect must justify each level.
Backup vs DR vs HA — the three pillars
| Concept | Purpose | Scope | Example |
|---|---|---|---|
| Backup | Recover from data loss or corruption | Data recovery (RPO) | Azure Backup restoring a VM from yesterday |
| Disaster Recovery (DR) | Resume operations after regional/site failure | Service recovery (RPO + RTO) | Azure Site Recovery failing over to secondary region |
| High Availability (HA) | Prevent downtime from component failures | Continuous operation (uptime) | Availability Zones — app survives data centre failure |
Critical distinction: Backup is NOT DR. DR is NOT HA. A backup protects your data but doesn’t keep the service running. DR gets you running again after a failure. HA prevents the failure from being visible to users. You need all three — designed to match your recovery objectives.
Knowledge check
🏦 FinSecure Bank's trading platform requires zero data loss and zero downtime. Their internal reporting dashboard can tolerate up to 4 hours of downtime and 1 hour of data loss. What recovery tiers should Elena assign?
🏗️ Priya's customer portal uses App Service (99.95%), Azure SQL Database (99.99%), and Blob Storage (99.99%). The business requires 99.99% availability. The current composite SLA is 99.93%. What should Priya add to meet the target?
🎬 Video coming soon
Next up: Recovery targets are set — now let’s design the backup solution — Backup & Recovery for Compute.