Pipeline Maintenance: Health, Migration & Retention

Why Pipeline Maintenance Matters

Simple explanation

Think of maintaining a car.

You would not drive 100,000 km without an oil change, tyre rotation, or brake check. The car still runs… until it does not. Pipelines are the same — they work fine until one day they are slow, flaky, or fail at the worst possible moment.

Pipeline maintenance means monitoring health (are builds failing too often?), optimising speed (why does this build take 45 minutes?), managing storage (do we really need build artifacts from 2 years ago?), and modernising (migrating from the old Classic editor to YAML).

Pipeline Health Monitoring

Key Metrics

Metric	What It Measures	Healthy Target	Red Flag
Failure rate	Percentage of pipeline runs that fail	Below 10%	Above 25% consistently
Mean duration	Average time from trigger to completion	Stable or decreasing	Increasing week-over-week
Queue time	Time a run waits for an available agent	Under 2 minutes	Over 10 minutes regularly
Flaky test rate	Tests that pass/fail non-deterministically	Under 2% of test suite	Over 5% — undermines CI trust
MTTR (Mean Time to Recovery)	How fast the team fixes a broken pipeline	Under 1 hour	Over 1 day

Azure Pipelines Analytics

Azure DevOps provides built-in analytics for pipeline health:

Pipeline pass rate — trend over time, filterable by branch and stage
Test analytics — identify flaky tests by tracking tests that pass and fail on the same code
Duration trends — spot regressions in build time
Pipeline runs dashboard widget — add to team dashboards for visibility

Access via: Pipelines > [select a pipeline] > Analytics tab, or add the Pipeline runs widget to team dashboards.

Flaky Tests

Flaky tests are tests that produce different results (pass/fail) for the same code without any changes. They destroy CI trust — developers start ignoring failures because “it is probably just flaky.”

Common causes:

Timing dependencies (sleep, race conditions)
Shared test state (tests depend on execution order)
External service dependencies (network calls in unit tests)
Timezone and locale differences between agents

Azure DevOps flaky test detection: Azure Pipelines automatically flags tests as flaky when the same test passes and fails on the same code within a window. You can configure the system to not fail the build for known flaky tests while still tracking them for resolution.

Question

How does Azure DevOps detect flaky tests?

Click or press Enter to reveal answer

Answer

Azure Pipelines tracks test results across runs on the same branch and code version. If a test passes on one run and fails on another WITHOUT any code change, it is flagged as flaky. You can configure the pipeline to not fail the build for known flaky tests (they pass automatically) while tracking them in the Test analytics dashboard for resolution.

Click to flip back

Question

What is pipeline queue time and why does it matter?

Click or press Enter to reveal answer

Answer

Queue time is the duration a pipeline run waits for an available agent before execution begins. High queue times indicate insufficient agent capacity or poor concurrency management. Solutions: add more agents (self-hosted pool), increase parallel job licenses (Microsoft-hosted), or optimise pipeline triggers to reduce concurrent demand.

Click to flip back

Pipeline Optimisation

Caching

Pipeline caching stores dependencies between runs to avoid redundant downloads.

Azure Pipelines — Cache@2 task:

- task: Cache@2
  inputs:
    key: 'npm | "$(Agent.OS)" | package-lock.json'
    path: '$(Pipeline.Workspace)/.npm'
    restoreKeys: |
      npm | "$(Agent.OS)"
  displayName: 'Cache npm packages'

GitHub Actions — actions/cache:

- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      npm-${{ runner.os }}-

Cache key design: Use the lock file hash as part of the key. When dependencies change, the lock file changes, invalidating the cache and forcing a fresh install.

Parallel Jobs and Test Sharding

Parallel jobs run independent tasks simultaneously rather than sequentially:

# Azure Pipelines — matrix strategy
strategy:
  matrix:
    linux:
      vmImage: 'ubuntu-latest'
    windows:
      vmImage: 'windows-latest'
    mac:
      vmImage: 'macOS-latest'

Test sharding splits a large test suite across parallel agents:

strategy:
  parallel: 4  # Split tests across 4 agents

Each agent runs a slice of the test suite. Total test time drops from the full suite duration to roughly one-quarter (wall clock) at the cost of four parallel agent-minutes.

Other Optimisation Techniques

Incremental builds — only rebuild changed modules (supported by MSBuild, Gradle, Bazel)
Docker layer caching — reuse unchanged layers in multi-stage builds
Pipeline triggers — use path filters to skip pipelines when only docs change
Cancel-in-progress — cancel running pipelines when a newer commit arrives on the same branch

Question

What is the purpose of cancel-in-progress in CI pipelines?

Click or press Enter to reveal answer

Answer

Cancel-in-progress automatically cancels any running pipeline when a newer commit is pushed to the same branch. This prevents wasted compute on builds that are already outdated — especially during active development with frequent pushes. In GitHub Actions, use concurrency with cancel-in-progress: true. In Azure Pipelines, enable the 'Cancel running and in-progress pipeline runs when a new run is triggered' option.

Click to flip back

Cost and Concurrency Optimisation

Concurrency Management

Approach	How It Works	Impact
Parallel job licenses	Set the max concurrent pipelines (Microsoft-hosted)	More licenses = faster throughput but higher cost
Cancel-in-progress	Cancel older runs on same branch	Saves agent time, reduces cost
Path filters	Only trigger pipelines for relevant file changes	Prevents unnecessary runs
Scheduled pipelines	Run nightly instead of on every push for heavy tests	Reduces peak concurrency demand
Self-hosted agents	Use your own VMs or containers	Fixed cost per agent, no per-minute billing

Exam tip: Cost vs speed trade-offs

The exam presents scenarios asking you to optimise for cost OR time. Key trade-offs:

More parallel jobs = faster builds but higher licensing cost
Self-hosted agents = lower per-minute cost but you pay for VM infrastructure and maintenance
Caching = faster builds at near-zero cost (just storage) — almost always worth implementing
Cancel-in-progress = saves money AND time — no trade-off, enable it by default
Test sharding = faster test runs but uses more parallel agent capacity

When the exam says “minimise cost” — look for caching, cancel-in-progress, and path filters. When the exam says “minimise time” — look for parallelism, sharding, and more agents.

Retention Strategies

Retention policies determine how long pipeline artifacts, test results, and run history are kept. Too short and you lose audit trails; too long and storage costs explode.

Artifact Type	Recommended Retention	Rationale
Build artifacts (binaries, packages)	30-90 days for development branches, 1 year for release branches	Release artifacts may need redeployment; dev artifacts are disposable
Pipeline run history	1-2 years	Audit and trend analysis
Test results	90 days minimum	Flaky test detection needs history; compliance may require longer
Container images in ACR	Keep tagged releases indefinitely, purge untagged after 30 days	Use ACR retention policies and scheduled purge tasks
NuGet/npm packages	Keep published versions indefinitely, purge pre-release after 90 days	Downstream projects may pin specific versions

Configuring Retention in Azure Pipelines

Project-level settings: default retention for all pipelines (Settings > Pipelines > Retention)
Pipeline-level override: individual pipelines can set longer retention for critical builds
Retention leases: protect specific runs from automatic cleanup (e.g., a production release)
Azure Artifacts retention: separate from pipeline retention — configured per feed

Question

What is a retention lease in Azure Pipelines?

Click or press Enter to reveal answer

Answer

A retention lease protects a specific pipeline run from automatic cleanup, regardless of the retention policy. Use it for production releases — you want to keep the build artifacts, test results, and deployment logs for that specific run even after the general retention window expires. Leases can be created manually, by pipeline tasks, or by release pipelines automatically.

Click to flip back

Migrating Classic to YAML

Classic pipelines use the visual editor (GUI) to define build and release pipelines. YAML pipelines define everything as code in a azure-pipelines.yml file checked into the repository.

Microsoft recommends YAML for all new pipelines — classic is in maintenance mode
Capability	Classic (Visual Editor)	YAML (Pipeline as Code)
Definition	GUI-based, stored in Azure DevOps	Code in repository (azure-pipelines.yml)
Version control	Limited — no native Git versioning of pipeline definition	Full Git history — branch, diff, review, revert
Code review	No PR-based review of pipeline changes	Pipeline changes go through PR review like application code
Branching	One pipeline definition shared across branches	Pipeline definition can differ per branch
Templates	Task groups (limited reuse)	Templates with parameters (powerful composition)
Multi-stage	Separate Build and Release definitions	Unified stages in a single YAML file
Environments	Deployment groups	Environments with approval gates and checks
Future direction	No new features being added	All investment going into YAML

The 6-Step Migration Process

Scenario: Nadia migrates Meridian from Classic to YAML

🏢 Nadia leads the migration of 47 classic pipelines at Meridian Insurance. Her 6-step process:

Step 1 — Inventory and prioritise: Export all classic pipeline definitions. Categorise by complexity (simple CI, multi-stage, release pipelines with gates). Start with the simplest.

Step 2 — Use the “View YAML” feature: Azure DevOps lets you view the YAML equivalent of a classic pipeline. This generates a starting point (though it often needs cleanup).

Step 3 — Convert build pipelines first: Build pipelines are simpler than release pipelines. Convert them to YAML, test on a branch, and validate that outputs match the classic version.

Step 4 — Convert release pipelines to multi-stage YAML: Classic release pipelines with stages, gates, and approvals become YAML stages with environment approvals and deployment jobs.

Step 5 — Add YAML-only features: Leverage features that Classic does not support: templates for reuse across pipelines, conditional insertions, matrix strategies, and pipeline-as-code review through PRs.

Step 6 — Decommission classic pipelines: Run both classic and YAML in parallel for one sprint. Once the YAML pipeline is validated, disable (do not delete) the classic pipeline for rollback safety. Delete after 30 days of successful YAML runs.

Nadia estimates 3 months for the full migration. She tracks progress on a shared dashboard, converting 3-4 pipelines per sprint. Dmitri (VP Eng) approves the plan because YAML pipelines can be code-reviewed — a requirement Elena (compliance) has been requesting for audit purposes.

Question

What Azure DevOps feature helps start a Classic-to-YAML migration?

Click or press Enter to reveal answer

Answer

The 'View YAML' button on classic pipeline tasks generates the YAML equivalent of the current classic definition. It provides a starting point for conversion, though the output usually needs cleanup — variable groups, service connections, and complex release stages may require manual translation.

Click to flip back

Knowledge Check

A team's CI pipeline takes 35 minutes. Most of the time is spent downloading npm packages (8 minutes) and running tests (22 minutes across 500 tests). How should they optimise?

Knowledge Check

Nadia's team wants to ensure that the build artifacts for every production release are kept indefinitely, even though the default project retention is 30 days. What should she configure?

Knowledge Check

Which capability is available in YAML pipelines but NOT in Classic pipelines?

Next up: Testing Strategy: Shift-Left and Continuous Testing (Domain 3 continues)