Safe Rollouts: Slots, Dependencies & Hotfix Paths

Why Safe Rollouts Require Planning

Simple explanation

Think of moving house.

You cannot set up the TV before the power is connected. You cannot unpack kitchen boxes before the shelves are assembled. There is a natural order — electricity first, then furniture, then electronics. If you do it out of order, things break or you waste time redoing work.

Safe rollouts follow the same principle. Deploy the database changes before the API that needs them. Deploy the API before the frontend that calls it. Get the order wrong, and users see errors. Get it right, and nobody notices you shipped anything at all.

Dependency Deployment Ordering

When your application has multiple tiers (database, API, frontend, background workers), deployment order matters. The golden rule: deploy bottom-up — infrastructure and data layers first, presentation layers last.

The Deployment Order

1. Database schema changes (expand phase)
2. Background services / workers
3. Backend APIs
4. API gateways / BFF layers
5. Frontend applications
6. Database cleanup (contract phase — after old code is fully retired)

The Expand-Contract Pattern

The expand-contract pattern (also called parallel change) ensures backward compatibility during multi-service deployments:

Expand phase:

Add the new database column (nullable or with default value)
Deploy new API version that writes to BOTH old and new columns
Old and new API versions coexist safely

Contract phase (after all consumers updated):

Migrate remaining data from old column to new column
Remove old column
Remove backward-compatibility code

This eliminates the “deploy database and API at the exact same millisecond” problem. Both versions work throughout the transition.

Scenario: Nadia orders Meridian's deployment

🏢 Nadia manages a claims processing system with four tiers: SQL Database, Claims API, Notification Service, and the Claims Portal (SPA).

Her YAML pipeline uses dependsOn to enforce the order:

stages:
  - stage: Database
    jobs:
      - job: MigrateSchema
  - stage: NotificationService
    dependsOn: Database
  - stage: ClaimsAPI
    dependsOn: Database
  - stage: Portal
    dependsOn:
      - ClaimsAPI
      - NotificationService

The Portal stage waits for BOTH ClaimsAPI and NotificationService to complete before deploying. If either fails, the Portal never deploys — preventing users from hitting a broken frontend.

Nadia also adds health check gates between stages. The ClaimsAPI stage does not complete until the deployed API passes a /health endpoint check. This prevents the Portal from deploying against an API that deployed but is not actually healthy.

Question

What is the expand-contract pattern in database deployments?

Click or press Enter to reveal answer

Answer

A two-phase approach for backward-compatible schema changes. EXPAND: add the new column/table alongside the old one, deploy code that writes to both. CONTRACT: after all consumers use the new schema, remove the old column/table and backward-compatibility code. This eliminates the need for simultaneous database and application deployments.

Click to flip back

Minimising Downtime

Load Balancing Strategies

Strategy	How It Works	Downtime	Use When
Deployment slots	Deploy to staging, swap to production	Zero	Azure App Service
Rolling update	Update pods/VMs one at a time behind LB	Zero (if enough replicas)	Kubernetes, VM Scale Sets
Blue-green via Traffic Manager	Switch DNS-level traffic between regions	Near-zero (DNS TTL)	Multi-region apps
Weighted routing	Send percentage of traffic to new deployment	Zero	Azure Front Door, Traffic Manager
Connection draining	Finish in-flight requests before removing instance	Zero	All LB-based strategies

Health Checks and Readiness Probes

Health checks ensure traffic only routes to healthy instances:

Liveness probe — is the process alive? Restart if not.
Readiness probe — can the instance serve traffic? Remove from LB if not.
Startup probe — is the app still starting up? Do not check liveness until startup completes.

In Azure App Service, configure the Health Check feature at /health — the platform automatically removes unhealthy instances from the load balancer rotation.

Question

What is the difference between a liveness probe and a readiness probe in Kubernetes?

Click or press Enter to reveal answer

Answer

A liveness probe checks if the container process is alive — if it fails, Kubernetes restarts the container. A readiness probe checks if the container can serve traffic — if it fails, Kubernetes removes the pod from the Service endpoints (no traffic routed to it). An app can be alive but not ready (e.g., still loading cache).

Click to flip back

Question

What is connection draining and why is it critical during deployments?

Click or press Enter to reveal answer

Answer

Connection draining (also called graceful shutdown) allows in-flight requests to complete before an instance is removed from the load balancer. Without it, active users get dropped connections mid-request during deployments. Azure Load Balancer, Application Gateway, and Kubernetes Services all support configurable drain timeouts.

Click to flip back

Hotfix Path Planning

A hotfix path is a pre-planned, expedited route from code fix to production that bypasses the normal release cadence. Every team needs one BEFORE the first emergency.

Standard Flow vs Hotfix Flow

Pre-plan both paths so the team knows exactly what to do under pressure
Aspect	Standard Release	Hotfix Path
Trigger	Sprint end / release cadence	Critical production bug (P0/P1)
Branch source	Feature branch from main/develop	Hotfix branch from release tag or main
Testing	Full regression, UAT, performance	Targeted fix validation + smoke tests
Approval	Normal approval gates	Expedited approval (on-call lead + 1 reviewer)
Environments	Dev to Staging to Production	Hotfix env to Production (skip lower envs)
Deployment	Scheduled maintenance window	Immediate — ASAP
Post-deploy	Standard monitoring	Enhanced monitoring + incident bridge open
Merge back	N/A (already in main)	Cherry-pick or merge hotfix branch back to main AND develop

Hotfix Branching Approaches

Git Flow hotfix: Create hotfix/critical-fix from the main (or release) branch. Fix, test, deploy. Merge back into BOTH main and develop to prevent regression.

Trunk-based hotfix: Cherry-pick the fix commit from a feature branch (or commit directly to main if CI is fast enough). Deploy from main. The fix is already in the trunk.

Release branch hotfix: If you maintain release branches (release/2.4), apply the fix to the release branch, deploy, then cherry-pick to main for the next release.

Exam tip: Hotfix path questions

The exam often presents a scenario: “Production is down. The team has a fix ready. What is the FASTEST safe path to production?”

Key principles:

A hotfix path MUST still have at least one approval gate (no rogue deploys)
Automated tests must run — but only the subset relevant to the fix
The fix MUST be merged back to the main development branch after deployment
Skip lower environments only if you have a dedicated hotfix environment with production-like config
Document the expedited process BEFORE you need it — decisions made during incidents are worse than decisions made calmly

Resiliency Strategies for Deployment

Resiliency is not just about the application — your deployment pipeline itself must be resilient.

Application Resiliency Patterns

Pattern	What It Does	When to Use
Retry with backoff	Retry failed requests with increasing delays	Transient failures (network blips, throttling)
Circuit breaker	Stop calling a failing service, return fallback	Downstream service is consistently failing
Bulkhead	Isolate resources per consumer/feature	Prevent one failing feature from taking down everything
Graceful degradation	Disable non-critical features during partial outages	Maintain core functionality when dependencies fail
Immutable infrastructure	Never patch in place — replace with new instances	Eliminate configuration drift, ensure consistency

Pipeline Resiliency Patterns

Automatic rollback — if post-deployment health checks fail, automatically redeploy the previous version
Deployment gates — automated quality gates between stages (Azure Monitor alerts, SonarQube quality gate, custom API checks)
Approval timeouts — approvals expire after a window to prevent stale deployments sitting in the pipeline
Retry on transient failure — configure pipeline tasks to retry on infrastructure errors (network timeout, agent unavailable)

Question

What is the circuit breaker pattern and how does it relate to deployment resiliency?

Click or press Enter to reveal answer

Answer

A circuit breaker monitors calls to a downstream service. After a threshold of failures, it 'opens' and immediately returns a fallback response instead of attempting the call. After a cooldown, it allows a test call through (half-open state). This prevents cascading failures during deployments when a newly deployed service is unhealthy. Azure API Management and Polly (.NET) implement this pattern.

Click to flip back

Automatic Rollback Configuration

In Azure Pipelines, configure automatic rollback using the on: failure hook:

stages:
  - stage: Production
    jobs:
      - deployment: Deploy
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureWebApp@1
                  inputs:
                    appName: 'claims-api'
            on:
              failure:
                steps:
                  - task: AzureAppServiceManage@0
                    inputs:
                      Action: 'Swap Slots'
                      WebAppName: 'claims-api'
                      SourceSlot: 'production'
                      TargetSlot: 'staging'

In GitHub Actions, use a separate rollback job that runs if: failure() and references the previous stable deployment.

Question

How do deployment gates work in Azure Pipelines?

Click or press Enter to reveal answer

Answer

Deployment gates are automated checks evaluated between pipeline stages. Examples: query Azure Monitor for active alerts (no alerts = pass), check SonarQube quality gate, invoke a REST API that returns pass/fail. Gates are evaluated repeatedly at a configurable interval until they pass, timeout, or the deployment is cancelled. They prevent promoting a deployment that does not meet quality criteria.

Click to flip back

Knowledge Check

Nadia's team deploys a multi-tier application: SQL Database, Claims API, Notification Service, and Portal SPA. The Portal calls the Claims API, which calls the Database. What is the correct deployment order?

Knowledge Check

Production is down due to a critical bug. The team has a fix ready and tested locally. The normal release process takes 4 hours with full regression testing. What should the team do?

Knowledge Check

Jordan configures a Kubernetes deployment with both liveness and readiness probes. During a rolling update, a new pod starts but its readiness probe fails for 30 seconds while caches warm up. What happens?

Next up: Deployment Implementations: Containers, Scripts and Databases