πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 2
Domain 2 β€” Module 4 of 8 50%
9 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 2: Implement Machine Learning Model Lifecycle and Operations Premium ⏱ ~12 min read

Distributed Training: Scale to Big Data

When your data or model doesn't fit on one machine, distribute it. Learn data parallelism, model parallelism, and how to configure distributed training in Azure ML.

Why distributed training?

β˜• Simple explanation

Imagine cooking dinner for 1,000 people.

One chef in one kitchen? Impossible. Instead, you split the work: one team preps vegetables (data parallel β€” same recipe, different ingredients). Or for a massive wedding cake, different teams build different layers at the same time (model parallel β€” different parts of the same thing).

Distributed training splits the work across multiple GPUs or machines so training that would take days on one machine takes hours across many.

Distributed training becomes necessary when:

  • Dataset is too large β€” can’t fit in memory of a single GPU/machine
  • Model is too large β€” model parameters exceed single GPU memory
  • Training is too slow β€” need to reduce wall-clock time

Azure ML supports distributed training through two main strategies: data parallelism (same model, split data) and model parallelism (same data, split model). Both work across multiple GPUs on one node or across multiple nodes in a compute cluster.

Data parallelism vs model parallelism

Two distributed training strategies
FeatureHow It WorksWhen To UseScaling Limit
Data ParallelismCopy the model to each GPU; split the dataset across GPUs. Each GPU processes a mini-batch; gradients are averaged.Dataset is large but model fits on one GPU. Most common approach.Limited by gradient synchronization overhead at high node counts.
Model ParallelismSplit the model across GPUs β€” each GPU holds different layers or parameters.Model is too large for one GPU (large language models, foundation models).Complex setup; inter-GPU communication is the bottleneck.

Configuring distributed training in Azure ML

Data parallelism with PyTorch

from azure.ai.ml import command
from azure.ai.ml.entities import ResourceConfiguration

# Define a distributed training job
distributed_job = command(
    code="./src",
    command="python train_distributed.py --epochs 50",
    environment="azureml:pytorch-training:1",
    compute="gpu-training-cluster",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 4,  # 4 GPUs per node
    },
    resources=ResourceConfiguration(
        instance_count=2,  # 2 nodes = 8 GPUs total
    ),
)

returned_job = ml_client.jobs.create_or_update(distributed_job)

What’s happening:

  • Lines 10-12: PyTorch distribution type β€” Azure ML sets up the communication backend automatically
  • Line 12: 4 processes per node (one per GPU)
  • Lines 14-15: 2 nodes with 4 GPUs each = 8 GPUs total processing data in parallel
  • Azure ML handles: node coordination, environment setup, distributed communication (NCCL backend)

The training script (data parallel)

# train_distributed.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Azure ML sets environment variables for distributed setup
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")

# Wrap model in DDP
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])

# DistributedSampler splits data across GPUs
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Training loop β€” same as single GPU, DDP handles gradient sync
for epoch in range(50):
    sampler.set_epoch(epoch)
    for batch in dataloader:
        # Forward, backward, step β€” DDP synchronizes automatically
        ...

What’s happening:

  • Line 7: Initialises distributed communication (Azure ML sets the environment variables)
  • Line 8: Each process gets assigned a GPU via LOCAL_RANK
  • Line 12: DistributedDataParallel wraps the model β€” handles gradient averaging across GPUs
  • Line 15: DistributedSampler ensures each GPU gets a different portion of the dataset
Scenario: Dr. Luca trains on genomics data across 16 GPUs

GenomeVault has a variant-calling model that takes 3 days to train on a single A100 GPU. The dataset is 2TB of genomic sequences.

Dr. Luca’s distributed training setup:

  • 4 nodes, each with 4 A100 GPUs = 16 GPUs total
  • Data parallelism β€” the model fits on one GPU but the dataset is massive
  • DistributedSampler splits the 2TB across all 16 GPUs
  • Training time: 3 days β†’ ~5 hours (14x speedup β€” not perfect linear due to communication overhead)

Prof. Sarah Lin: β€œWe can now iterate on model architectures weekly instead of monthly.”

Distributed training with TensorFlow/Horovod

distributed_job = command(
    code="./src",
    command="python train_tensorflow.py",
    environment="azureml:tensorflow-training:1",
    compute="gpu-training-cluster",
    distribution={
        "type": "TensorFlow",
        "worker_count": 8,  # 8 workers total
    },
    resources=ResourceConfiguration(
        instance_count=2,
    ),
)

For MPI-based frameworks (Horovod):

distribution={
    "type": "Mpi",
    "process_count_per_instance": 4,
}
πŸ’‘ Exam tip: Distribution types

The exam expects you to know which distribution type to use:

  • PyTorch: type: "PyTorch" β€” uses torch.distributed with NCCL backend
  • TensorFlow: type: "TensorFlow" β€” uses tf.distribute.Strategy
  • MPI: type: "Mpi" β€” for Horovod or custom MPI-based frameworks

The most common exam scenario involves PyTorch data parallelism with DistributedDataParallel.

Key considerations for distributed training

FactorImpactRecommendation
Batch sizeScale batch size with GPU count (linear scaling rule)If single GPU uses batch_size=32, 8 GPUs use batch_size=256
Learning rateScale with batch size or use warm-upLinear scaling rule: 8x batch = 8x learning rate (with warm-up)
CommunicationGradient sync is the bottleneckUse InfiniBand-enabled VMs (ND-series) for multi-node
CheckpointingSave model regularly in case of failuresEssential for multi-hour jobs β€” checkpoint every N epochs

Key terms flashcards

Question

Data parallelism vs model parallelism?

Click or press Enter to reveal answer

Answer

Data parallelism: copy model to each GPU, split dataset. Most common, best when model fits on one GPU. Model parallelism: split model across GPUs. Required when model is too large for one GPU.

Click to flip back

Question

What is DistributedDataParallel (DDP) in PyTorch?

Click or press Enter to reveal answer

Answer

A wrapper that replicates the model on each GPU and automatically synchronizes gradients during backpropagation. The standard way to do data-parallel training in PyTorch.

Click to flip back

Question

Why checkpoint during distributed training?

Click or press Enter to reveal answer

Answer

Multi-hour/multi-day jobs across multiple nodes are vulnerable to failures (node crashes, pre-emption). Checkpointing saves progress so training can resume from the last checkpoint instead of starting over.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca has a model that fits on a single GPU but a 2TB dataset. Training takes 3 days on one GPU. He has a cluster with 4 nodes, each with 4 GPUs. What strategy should he use?

Knowledge Check

Kai is configuring a PyTorch distributed training job in Azure ML. He needs 2 nodes with 4 GPUs each. What distribution configuration should he use?

🎬 Video coming soon


Next up: Model Registration & Versioning β€” from experiment to production-ready artifact.

← Previous

Training Pipelines: Automate Everything

Next β†’

Model Registration & Versioning

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.