Pipeline Maintenance: Health, Migration & Retention
Monitor pipeline health with failure rates and flaky tests. Optimise performance with caching and parallel jobs. Migrate classic pipelines to YAML and design retention strategies.
Why Pipeline Maintenance Matters
Think of maintaining a car.
You would not drive 100,000 km without an oil change, tyre rotation, or brake check. The car still runsβ¦ until it does not. Pipelines are the same β they work fine until one day they are slow, flaky, or fail at the worst possible moment.
Pipeline maintenance means monitoring health (are builds failing too often?), optimising speed (why does this build take 45 minutes?), managing storage (do we really need build artifacts from 2 years ago?), and modernising (migrating from the old Classic editor to YAML).
Pipeline Health Monitoring
Key Metrics
| Metric | What It Measures | Healthy Target | Red Flag |
|---|---|---|---|
| Failure rate | Percentage of pipeline runs that fail | Below 10% | Above 25% consistently |
| Mean duration | Average time from trigger to completion | Stable or decreasing | Increasing week-over-week |
| Queue time | Time a run waits for an available agent | Under 2 minutes | Over 10 minutes regularly |
| Flaky test rate | Tests that pass/fail non-deterministically | Under 2% of test suite | Over 5% β undermines CI trust |
| MTTR (Mean Time to Recovery) | How fast the team fixes a broken pipeline | Under 1 hour | Over 1 day |
Azure Pipelines Analytics
Azure DevOps provides built-in analytics for pipeline health:
- Pipeline pass rate β trend over time, filterable by branch and stage
- Test analytics β identify flaky tests by tracking tests that pass and fail on the same code
- Duration trends β spot regressions in build time
- Pipeline runs dashboard widget β add to team dashboards for visibility
Access via: Pipelines > [select a pipeline] > Analytics tab, or add the Pipeline runs widget to team dashboards.
Flaky Tests
Flaky tests are tests that produce different results (pass/fail) for the same code without any changes. They destroy CI trust β developers start ignoring failures because βit is probably just flaky.β
Common causes:
- Timing dependencies (sleep, race conditions)
- Shared test state (tests depend on execution order)
- External service dependencies (network calls in unit tests)
- Timezone and locale differences between agents
Azure DevOps flaky test detection: Azure Pipelines automatically flags tests as flaky when the same test passes and fails on the same code within a window. You can configure the system to not fail the build for known flaky tests while still tracking them for resolution.
Pipeline Optimisation
Caching
Pipeline caching stores dependencies between runs to avoid redundant downloads.
Azure Pipelines β Cache@2 task:
- task: Cache@2
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json'
path: '$(Pipeline.Workspace)/.npm'
restoreKeys: |
npm | "$(Agent.OS)"
displayName: 'Cache npm packages'
GitHub Actions β actions/cache:
- uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
npm-${{ runner.os }}-
Cache key design: Use the lock file hash as part of the key. When dependencies change, the lock file changes, invalidating the cache and forcing a fresh install.
Parallel Jobs and Test Sharding
Parallel jobs run independent tasks simultaneously rather than sequentially:
# Azure Pipelines β matrix strategy
strategy:
matrix:
linux:
vmImage: 'ubuntu-latest'
windows:
vmImage: 'windows-latest'
mac:
vmImage: 'macOS-latest'
Test sharding splits a large test suite across parallel agents:
strategy:
parallel: 4 # Split tests across 4 agents
Each agent runs a slice of the test suite. Total test time drops from the full suite duration to roughly one-quarter (wall clock) at the cost of four parallel agent-minutes.
Other Optimisation Techniques
- Incremental builds β only rebuild changed modules (supported by MSBuild, Gradle, Bazel)
- Docker layer caching β reuse unchanged layers in multi-stage builds
- Pipeline triggers β use path filters to skip pipelines when only docs change
- Cancel-in-progress β cancel running pipelines when a newer commit arrives on the same branch
Cost and Concurrency Optimisation
Concurrency Management
| Approach | How It Works | Impact |
|---|---|---|
| Parallel job licenses | Set the max concurrent pipelines (Microsoft-hosted) | More licenses = faster throughput but higher cost |
| Cancel-in-progress | Cancel older runs on same branch | Saves agent time, reduces cost |
| Path filters | Only trigger pipelines for relevant file changes | Prevents unnecessary runs |
| Scheduled pipelines | Run nightly instead of on every push for heavy tests | Reduces peak concurrency demand |
| Self-hosted agents | Use your own VMs or containers | Fixed cost per agent, no per-minute billing |
Exam tip: Cost vs speed trade-offs
The exam presents scenarios asking you to optimise for cost OR time. Key trade-offs:
- More parallel jobs = faster builds but higher licensing cost
- Self-hosted agents = lower per-minute cost but you pay for VM infrastructure and maintenance
- Caching = faster builds at near-zero cost (just storage) β almost always worth implementing
- Cancel-in-progress = saves money AND time β no trade-off, enable it by default
- Test sharding = faster test runs but uses more parallel agent capacity
When the exam says βminimise costβ β look for caching, cancel-in-progress, and path filters. When the exam says βminimise timeβ β look for parallelism, sharding, and more agents.
Retention Strategies
Retention policies determine how long pipeline artifacts, test results, and run history are kept. Too short and you lose audit trails; too long and storage costs explode.
| Artifact Type | Recommended Retention | Rationale |
|---|---|---|
| Build artifacts (binaries, packages) | 30-90 days for development branches, 1 year for release branches | Release artifacts may need redeployment; dev artifacts are disposable |
| Pipeline run history | 1-2 years | Audit and trend analysis |
| Test results | 90 days minimum | Flaky test detection needs history; compliance may require longer |
| Container images in ACR | Keep tagged releases indefinitely, purge untagged after 30 days | Use ACR retention policies and scheduled purge tasks |
| NuGet/npm packages | Keep published versions indefinitely, purge pre-release after 90 days | Downstream projects may pin specific versions |
Configuring Retention in Azure Pipelines
- Project-level settings: default retention for all pipelines (Settings > Pipelines > Retention)
- Pipeline-level override: individual pipelines can set longer retention for critical builds
- Retention leases: protect specific runs from automatic cleanup (e.g., a production release)
- Azure Artifacts retention: separate from pipeline retention β configured per feed
Migrating Classic to YAML
Classic pipelines use the visual editor (GUI) to define build and release pipelines. YAML pipelines define everything as code in a azure-pipelines.yml file checked into the repository.
| Capability | Classic (Visual Editor) | YAML (Pipeline as Code) |
|---|---|---|
| Definition | GUI-based, stored in Azure DevOps | Code in repository (azure-pipelines.yml) |
| Version control | Limited β no native Git versioning of pipeline definition | Full Git history β branch, diff, review, revert |
| Code review | No PR-based review of pipeline changes | Pipeline changes go through PR review like application code |
| Branching | One pipeline definition shared across branches | Pipeline definition can differ per branch |
| Templates | Task groups (limited reuse) | Templates with parameters (powerful composition) |
| Multi-stage | Separate Build and Release definitions | Unified stages in a single YAML file |
| Environments | Deployment groups | Environments with approval gates and checks |
| Future direction | No new features being added | All investment going into YAML |
The 6-Step Migration Process
Scenario: Nadia migrates Meridian from Classic to YAML
π’ Nadia leads the migration of 47 classic pipelines at Meridian Insurance. Her 6-step process:
Step 1 β Inventory and prioritise: Export all classic pipeline definitions. Categorise by complexity (simple CI, multi-stage, release pipelines with gates). Start with the simplest.
Step 2 β Use the βView YAMLβ feature: Azure DevOps lets you view the YAML equivalent of a classic pipeline. This generates a starting point (though it often needs cleanup).
Step 3 β Convert build pipelines first: Build pipelines are simpler than release pipelines. Convert them to YAML, test on a branch, and validate that outputs match the classic version.
Step 4 β Convert release pipelines to multi-stage YAML: Classic release pipelines with stages, gates, and approvals become YAML stages with environment approvals and deployment jobs.
Step 5 β Add YAML-only features: Leverage features that Classic does not support: templates for reuse across pipelines, conditional insertions, matrix strategies, and pipeline-as-code review through PRs.
Step 6 β Decommission classic pipelines: Run both classic and YAML in parallel for one sprint. Once the YAML pipeline is validated, disable (do not delete) the classic pipeline for rollback safety. Delete after 30 days of successful YAML runs.
Nadia estimates 3 months for the full migration. She tracks progress on a shared dashboard, converting 3-4 pipelines per sprint. Dmitri (VP Eng) approves the plan because YAML pipelines can be code-reviewed β a requirement Elena (compliance) has been requesting for audit purposes.
A team's CI pipeline takes 35 minutes. Most of the time is spent downloading npm packages (8 minutes) and running tests (22 minutes across 500 tests). How should they optimise?
Nadia's team wants to ensure that the build artifacts for every production release are kept indefinitely, even though the default project retention is 30 days. What should she configure?
Which capability is available in YAML pipelines but NOT in Classic pipelines?
π¬ Video coming soon
Pipeline Health, Migration and Retention
Next up: Testing Strategy: Shift-Left and Continuous Testing (Domain 3 continues)