Distributed Training: Scale to Big Data
When your data or model doesn't fit on one machine, distribute it. Learn data parallelism, model parallelism, and how to configure distributed training in Azure ML.
Why distributed training?
Imagine cooking dinner for 1,000 people.
One chef in one kitchen? Impossible. Instead, you split the work: one team preps vegetables (data parallel β same recipe, different ingredients). Or for a massive wedding cake, different teams build different layers at the same time (model parallel β different parts of the same thing).
Distributed training splits the work across multiple GPUs or machines so training that would take days on one machine takes hours across many.
Data parallelism vs model parallelism
| Feature | How It Works | When To Use | Scaling Limit |
|---|---|---|---|
| Data Parallelism | Copy the model to each GPU; split the dataset across GPUs. Each GPU processes a mini-batch; gradients are averaged. | Dataset is large but model fits on one GPU. Most common approach. | Limited by gradient synchronization overhead at high node counts. |
| Model Parallelism | Split the model across GPUs β each GPU holds different layers or parameters. | Model is too large for one GPU (large language models, foundation models). | Complex setup; inter-GPU communication is the bottleneck. |
Configuring distributed training in Azure ML
Data parallelism with PyTorch
from azure.ai.ml import command
from azure.ai.ml.entities import ResourceConfiguration
# Define a distributed training job
distributed_job = command(
code="./src",
command="python train_distributed.py --epochs 50",
environment="azureml:pytorch-training:1",
compute="gpu-training-cluster",
distribution={
"type": "PyTorch",
"process_count_per_instance": 4, # 4 GPUs per node
},
resources=ResourceConfiguration(
instance_count=2, # 2 nodes = 8 GPUs total
),
)
returned_job = ml_client.jobs.create_or_update(distributed_job)
Whatβs happening:
- Lines 10-12: PyTorch distribution type β Azure ML sets up the communication backend automatically
- Line 12: 4 processes per node (one per GPU)
- Lines 14-15: 2 nodes with 4 GPUs each = 8 GPUs total processing data in parallel
- Azure ML handles: node coordination, environment setup, distributed communication (NCCL backend)
The training script (data parallel)
# train_distributed.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Azure ML sets environment variables for distributed setup
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
# Wrap model in DDP
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])
# DistributedSampler splits data across GPUs
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
# Training loop β same as single GPU, DDP handles gradient sync
for epoch in range(50):
sampler.set_epoch(epoch)
for batch in dataloader:
# Forward, backward, step β DDP synchronizes automatically
...
Whatβs happening:
- Line 7: Initialises distributed communication (Azure ML sets the environment variables)
- Line 8: Each process gets assigned a GPU via
LOCAL_RANK - Line 12:
DistributedDataParallelwraps the model β handles gradient averaging across GPUs - Line 15:
DistributedSamplerensures each GPU gets a different portion of the dataset
Scenario: Dr. Luca trains on genomics data across 16 GPUs
GenomeVault has a variant-calling model that takes 3 days to train on a single A100 GPU. The dataset is 2TB of genomic sequences.
Dr. Lucaβs distributed training setup:
- 4 nodes, each with 4 A100 GPUs = 16 GPUs total
- Data parallelism β the model fits on one GPU but the dataset is massive
- DistributedSampler splits the 2TB across all 16 GPUs
- Training time: 3 days β ~5 hours (14x speedup β not perfect linear due to communication overhead)
Prof. Sarah Lin: βWe can now iterate on model architectures weekly instead of monthly.β
Distributed training with TensorFlow/Horovod
distributed_job = command(
code="./src",
command="python train_tensorflow.py",
environment="azureml:tensorflow-training:1",
compute="gpu-training-cluster",
distribution={
"type": "TensorFlow",
"worker_count": 8, # 8 workers total
},
resources=ResourceConfiguration(
instance_count=2,
),
)
For MPI-based frameworks (Horovod):
distribution={
"type": "Mpi",
"process_count_per_instance": 4,
}
Exam tip: Distribution types
The exam expects you to know which distribution type to use:
- PyTorch:
type: "PyTorch"β usestorch.distributedwith NCCL backend - TensorFlow:
type: "TensorFlow"β usestf.distribute.Strategy - MPI:
type: "Mpi"β for Horovod or custom MPI-based frameworks
The most common exam scenario involves PyTorch data parallelism with DistributedDataParallel.
Key considerations for distributed training
| Factor | Impact | Recommendation |
|---|---|---|
| Batch size | Scale batch size with GPU count (linear scaling rule) | If single GPU uses batch_size=32, 8 GPUs use batch_size=256 |
| Learning rate | Scale with batch size or use warm-up | Linear scaling rule: 8x batch = 8x learning rate (with warm-up) |
| Communication | Gradient sync is the bottleneck | Use InfiniBand-enabled VMs (ND-series) for multi-node |
| Checkpointing | Save model regularly in case of failures | Essential for multi-hour jobs β checkpoint every N epochs |
Key terms flashcards
Knowledge check
Dr. Luca has a model that fits on a single GPU but a 2TB dataset. Training takes 3 days on one GPU. He has a cluster with 4 nodes, each with 4 GPUs. What strategy should he use?
Kai is configuring a PyTorch distributed training job in Azure ML. He needs 2 nodes with 4 GPUs each. What distribution configuration should he use?
π¬ Video coming soon
Next up: Model Registration & Versioning β from experiment to production-ready artifact.