Training Pipelines: Automate Everything

Why pipelines?

Simple explanation

Imagine a car assembly line vs building a car by hand.

Building by hand: one person does everything — welding, painting, engine, interior. If they’re sick, nothing happens. If a step fails, you start over.

Assembly line: each station does one job. Raw metal goes in, a finished car comes out. If the painting station fails, you fix just that station. You can run the line 24/7.

ML pipelines are the assembly line for model training. Data prep → feature engineering → training → evaluation → registration. Each step is a reusable component. The whole pipeline runs automatically, logs everything, and can be triggered by GitHub Actions.

Notebooks vs scripts vs pipelines

Notebooks vs scripts vs pipelines
Feature	Reproducible	Automatable	Production-Ready	Best For
Notebooks (.ipynb)	Low — cell order matters	Hard — requires conversion	No	Exploration, EDA, prototyping
Scripts (.py)	Medium — deterministic	Yes — CLI/SDK submission	Partial	Single training jobs, simple workflows
Pipelines	High — defined DAG	Yes — CI/CD triggers	Yes	Production training, multi-step workflows

Exam tip: Notebooks in production

The exam recognises notebooks for exploration and experimentation but NOT for production training. If a question asks “what should a team use for production model training,” the answer is pipelines (or scripts submitted as jobs), never notebooks.

Notebooks are great for:

Exploratory data analysis (EDA)
Rapid prototyping
Sharing results with stakeholders (visual outputs)

But they fail in production because:

Cell execution order is fragile
Hard to parameterise for different datasets
Difficult to test and version reliably

Building a pipeline with Python SDK v2

from azure.ai.ml import load_component, Input
from azure.ai.ml.dsl import pipeline

# Load reusable components from YAML definitions
prepare_data = load_component(source="components/prepare/component.yaml")
train_model = load_component(source="components/train/component.yaml")
evaluate_model = load_component(source="components/evaluate/component.yaml")

@pipeline(
    display_name="churn-training-pipeline",
    compute="gpu-training-cluster",
    experiment_name="churn-pipeline-runs"
)
def churn_pipeline(raw_data: Input, target_metric: float = 0.90):
    # Step 1: Data preparation
    prep_step = prepare_data(input_data=raw_data)

    # Step 2: Training (uses output from step 1)
    train_step = train_model(
        training_data=prep_step.outputs.cleaned_data,
        target_column="churned"
    )

    # Step 3: Evaluation (uses output from step 2)
    eval_step = evaluate_model(
        model=train_step.outputs.trained_model,
        test_data=prep_step.outputs.test_data,
        threshold=target_metric
    )

    return eval_step.outputs

# Create and submit the pipeline
pipeline_job = churn_pipeline(
    raw_data=Input(type="uri_folder", path="azureml:churn-data:2")
)
returned_job = ml_client.jobs.create_or_update(pipeline_job)

What’s happening:

Lines 5-7: Load components from YAML — each is a reusable building block
Lines 9-13: The @pipeline decorator defines the workflow metadata
Line 14: The pipeline function accepts inputs — parameterised for different datasets
Lines 16-29: Steps are chained by connecting outputs to inputs — Azure ML figures out the execution order
Line 36: One line to submit the entire pipeline to the cloud

Pipeline YAML definition (alternative)

You can also define pipelines in YAML (often preferred for CI/CD):

# pipelines/training-pipeline.yaml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: churn-training-pipeline
experiment_name: churn-pipeline-runs
compute: azureml:gpu-training-cluster

inputs:
  raw_data:
    type: uri_folder
    path: azureml:churn-data:2
  target_metric: 0.90

jobs:
  prepare:
    type: command
    component: file:components/prepare/component.yaml
    inputs:
      input_data: ${{parent.inputs.raw_data}}

  train:
    type: command
    component: file:components/train/component.yaml
    inputs:
      training_data: ${{parent.jobs.prepare.outputs.cleaned_data}}
      target_column: churned

  evaluate:
    type: command
    component: file:components/evaluate/component.yaml
    inputs:
      model: ${{parent.jobs.train.outputs.trained_model}}
      test_data: ${{parent.jobs.prepare.outputs.test_data}}
      threshold: ${{parent.inputs.target_metric}}

What’s happening:

Lines 15-19: Step 1 references the pipeline input
Lines 21-26: Step 2 references Step 1’s output — creating the dependency chain
Lines 28-34: Step 3 evaluates using outputs from both previous steps

Exam tip: Python DSL vs YAML pipelines

Both approaches create identical pipelines. The exam may test when to use each:

YAML pipelines: better for CI/CD (GitHub Actions can submit them directly), version-controlled, easy to review in PRs
Python DSL (@pipeline): better for complex logic, conditional steps, dynamic parameterisation

Most production MLOps teams use YAML for CI/CD pipelines and Python DSL for experimentation.

Scenario: Kai's automated retraining pipeline

NeuralSpark’s churn model needs monthly retraining on fresh data. Kai builds a pipeline triggered by GitHub Actions on the 1st of each month:

Data prep — pulls latest customer data, cleans, splits
Training — trains on fresh data with the same hyperparameters
Evaluation — compares new model against production baseline
Gate — if new model beats baseline by more than 1%, proceed
Registration — registers the new model in the registry

The pipeline runs unattended. If the new model isn’t better, it stops at the gate and alerts the team.

Step caching

Azure ML caches pipeline step outputs. If a step’s inputs and code haven’t changed, Azure ML reuses the previous output instead of re-running.

This means:

Changing only the training script re-runs training and evaluation, but skips data prep
Changing the dataset re-runs everything from prep onwards
Changing the evaluation threshold re-runs only evaluation

Key terms flashcards

Question

What is an Azure ML pipeline?

Click or press Enter to reveal answer

Answer

A workflow of connected components (steps) that automates the ML training lifecycle — from data prep through training to evaluation. Each step has defined inputs/outputs, logs to MLflow, and can be cached.

Click to flip back

Question

YAML pipeline vs Python DSL pipeline?

Click or press Enter to reveal answer

Answer

YAML: better for CI/CD (GitHub Actions), easy to review in PRs, version-controlled. Python DSL (@pipeline): better for complex logic, conditional steps, dynamic parameters. Both create identical pipelines.

Click to flip back

Question

What is step caching in pipelines?

Click or press Enter to reveal answer

Answer

Azure ML reuses output from unchanged steps instead of re-running them. Only steps whose inputs or code changed are re-executed. Saves time and compute.

Click to flip back

Question

Why not use notebooks for production training?

Click or press Enter to reveal answer

Answer

Notebooks have fragile cell execution order, are hard to parametrise, difficult to test, and don't integrate well with CI/CD. Use pipelines or scripts submitted as jobs for production.

Click to flip back

Knowledge check

Knowledge Check

NeuralSpark's training pipeline has 3 steps: data prep, training, and evaluation. Kai changes only the training script. What happens when the pipeline re-runs?

Knowledge Check

Dr. Fatima's compliance team requires that every production model training workflow is fully traceable and can be triggered automatically from CI/CD. What should she use?

Next up: Distributed Training — scaling to datasets and models that don’t fit on a single machine.