Safe Rollouts: Slots, Dependencies & Hotfix Paths
Ensure reliable deployments with dependency ordering, deployment slot swaps, hotfix planning, and resiliency strategies. Minimise downtime with load balancing and rolling updates.
Why Safe Rollouts Require Planning
Think of moving house.
You cannot set up the TV before the power is connected. You cannot unpack kitchen boxes before the shelves are assembled. There is a natural order β electricity first, then furniture, then electronics. If you do it out of order, things break or you waste time redoing work.
Safe rollouts follow the same principle. Deploy the database changes before the API that needs them. Deploy the API before the frontend that calls it. Get the order wrong, and users see errors. Get it right, and nobody notices you shipped anything at all.
Dependency Deployment Ordering
When your application has multiple tiers (database, API, frontend, background workers), deployment order matters. The golden rule: deploy bottom-up β infrastructure and data layers first, presentation layers last.
The Deployment Order
1. Database schema changes (expand phase)
2. Background services / workers
3. Backend APIs
4. API gateways / BFF layers
5. Frontend applications
6. Database cleanup (contract phase β after old code is fully retired)
The Expand-Contract Pattern
The expand-contract pattern (also called parallel change) ensures backward compatibility during multi-service deployments:
Expand phase:
- Add the new database column (nullable or with default value)
- Deploy new API version that writes to BOTH old and new columns
- Old and new API versions coexist safely
Contract phase (after all consumers updated):
- Migrate remaining data from old column to new column
- Remove old column
- Remove backward-compatibility code
This eliminates the βdeploy database and API at the exact same millisecondβ problem. Both versions work throughout the transition.
Scenario: Nadia orders Meridian's deployment
π’ Nadia manages a claims processing system with four tiers: SQL Database, Claims API, Notification Service, and the Claims Portal (SPA).
Her YAML pipeline uses dependsOn to enforce the order:
stages:
- stage: Database
jobs:
- job: MigrateSchema
- stage: NotificationService
dependsOn: Database
- stage: ClaimsAPI
dependsOn: Database
- stage: Portal
dependsOn:
- ClaimsAPI
- NotificationServiceThe Portal stage waits for BOTH ClaimsAPI and NotificationService to complete before deploying. If either fails, the Portal never deploys β preventing users from hitting a broken frontend.
Nadia also adds health check gates between stages. The ClaimsAPI stage does not complete until the deployed API passes a /health endpoint check. This prevents the Portal from deploying against an API that deployed but is not actually healthy.
Minimising Downtime
Load Balancing Strategies
| Strategy | How It Works | Downtime | Use When |
|---|---|---|---|
| Deployment slots | Deploy to staging, swap to production | Zero | Azure App Service |
| Rolling update | Update pods/VMs one at a time behind LB | Zero (if enough replicas) | Kubernetes, VM Scale Sets |
| Blue-green via Traffic Manager | Switch DNS-level traffic between regions | Near-zero (DNS TTL) | Multi-region apps |
| Weighted routing | Send percentage of traffic to new deployment | Zero | Azure Front Door, Traffic Manager |
| Connection draining | Finish in-flight requests before removing instance | Zero | All LB-based strategies |
Health Checks and Readiness Probes
Health checks ensure traffic only routes to healthy instances:
- Liveness probe β is the process alive? Restart if not.
- Readiness probe β can the instance serve traffic? Remove from LB if not.
- Startup probe β is the app still starting up? Do not check liveness until startup completes.
In Azure App Service, configure the Health Check feature at /health β the platform automatically removes unhealthy instances from the load balancer rotation.
Hotfix Path Planning
A hotfix path is a pre-planned, expedited route from code fix to production that bypasses the normal release cadence. Every team needs one BEFORE the first emergency.
Standard Flow vs Hotfix Flow
| Aspect | Standard Release | Hotfix Path |
|---|---|---|
| Trigger | Sprint end / release cadence | Critical production bug (P0/P1) |
| Branch source | Feature branch from main/develop | Hotfix branch from release tag or main |
| Testing | Full regression, UAT, performance | Targeted fix validation + smoke tests |
| Approval | Normal approval gates | Expedited approval (on-call lead + 1 reviewer) |
| Environments | Dev to Staging to Production | Hotfix env to Production (skip lower envs) |
| Deployment | Scheduled maintenance window | Immediate β ASAP |
| Post-deploy | Standard monitoring | Enhanced monitoring + incident bridge open |
| Merge back | N/A (already in main) | Cherry-pick or merge hotfix branch back to main AND develop |
Hotfix Branching Approaches
Git Flow hotfix: Create hotfix/critical-fix from the main (or release) branch. Fix, test, deploy. Merge back into BOTH main and develop to prevent regression.
Trunk-based hotfix: Cherry-pick the fix commit from a feature branch (or commit directly to main if CI is fast enough). Deploy from main. The fix is already in the trunk.
Release branch hotfix: If you maintain release branches (release/2.4), apply the fix to the release branch, deploy, then cherry-pick to main for the next release.
Exam tip: Hotfix path questions
The exam often presents a scenario: βProduction is down. The team has a fix ready. What is the FASTEST safe path to production?β
Key principles:
- A hotfix path MUST still have at least one approval gate (no rogue deploys)
- Automated tests must run β but only the subset relevant to the fix
- The fix MUST be merged back to the main development branch after deployment
- Skip lower environments only if you have a dedicated hotfix environment with production-like config
- Document the expedited process BEFORE you need it β decisions made during incidents are worse than decisions made calmly
Resiliency Strategies for Deployment
Resiliency is not just about the application β your deployment pipeline itself must be resilient.
Application Resiliency Patterns
| Pattern | What It Does | When to Use |
|---|---|---|
| Retry with backoff | Retry failed requests with increasing delays | Transient failures (network blips, throttling) |
| Circuit breaker | Stop calling a failing service, return fallback | Downstream service is consistently failing |
| Bulkhead | Isolate resources per consumer/feature | Prevent one failing feature from taking down everything |
| Graceful degradation | Disable non-critical features during partial outages | Maintain core functionality when dependencies fail |
| Immutable infrastructure | Never patch in place β replace with new instances | Eliminate configuration drift, ensure consistency |
Pipeline Resiliency Patterns
- Automatic rollback β if post-deployment health checks fail, automatically redeploy the previous version
- Deployment gates β automated quality gates between stages (Azure Monitor alerts, SonarQube quality gate, custom API checks)
- Approval timeouts β approvals expire after a window to prevent stale deployments sitting in the pipeline
- Retry on transient failure β configure pipeline tasks to retry on infrastructure errors (network timeout, agent unavailable)
Automatic Rollback Configuration
In Azure Pipelines, configure automatic rollback using the on: failure hook:
stages:
- stage: Production
jobs:
- deployment: Deploy
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
appName: 'claims-api'
on:
failure:
steps:
- task: AzureAppServiceManage@0
inputs:
Action: 'Swap Slots'
WebAppName: 'claims-api'
SourceSlot: 'production'
TargetSlot: 'staging'
In GitHub Actions, use a separate rollback job that runs if: failure() and references the previous stable deployment.
Nadia's team deploys a multi-tier application: SQL Database, Claims API, Notification Service, and Portal SPA. The Portal calls the Claims API, which calls the Database. What is the correct deployment order?
Production is down due to a critical bug. The team has a fix ready and tested locally. The normal release process takes 4 hours with full regression testing. What should the team do?
Jordan configures a Kubernetes deployment with both liveness and readiness probes. During a rolling update, a new pod starts but its readiness probe fails for 30 seconds while caches warm up. What happens?
π¬ Video coming soon
Safe Rollouts and Hotfix Paths
Next up: Deployment Implementations: Containers, Scripts and Databases