Compute Targets: Choosing the Right Engine

Why compute matters in MLOps

Simple explanation

Think of compute like a car rental.

For daily errands (exploring data, writing code), you rent a small hatchback — cheap, always available. For moving day (training a large model), you rent a truck — expensive, but you only need it for a few hours. For a taxi service (serving predictions to users), you need a reliable sedan that’s always running.

Azure ML gives you different “vehicles” for each job. Picking the wrong one means either wasting money or waiting too long.

Compute types at a glance

Azure ML compute options
Feature	Scaling	GPU Support	Cost Model	Best For
Compute Instance	Single VM (no scaling)	Yes	Pay while running	Dev, notebooks, debugging
Compute Cluster	0 to N nodes (auto-scale)	Yes	Pay per node-minute, scale to 0	Training jobs, sweeps, pipelines
Serverless Compute	On-demand (fully managed)	Yes	Pay per job	Burst training, no infra management
Kubernetes (AKS)	Managed by AKS	Yes	AKS pricing	Hybrid, multi-cloud, existing clusters
Managed Endpoint	Auto-scale instances	Yes	Pay per instance-hour	Real-time inference

Compute instances: your personal dev machine

A compute instance is a single managed VM for interactive development. Think of it as your cloud-based workstation.

# Create a compute instance
az ml compute create \
  --name kai-dev-vm \
  --type ComputeInstance \
  --size Standard_DS3_v2 \
  --resource-group rg-ml-dev \
  --workspace-name neuralspark-dev

Key facts:

One user per instance — assigned to a specific identity
Doesn’t auto-scale — it’s a single VM
Schedule auto-shutdown to save costs (critical for exam scenarios)
Can run Jupyter notebooks, VS Code, and terminal directly

Scenario: Kai's cost-saving compute schedule

NeuralSpark’s cloud bill spiked because data scientists left compute instances running overnight. Kai’s fix:

Enable auto-shutdown at 7 PM for all dev instances
Enable idle shutdown after 60 minutes of inactivity
Use Standard_DS3_v2 (16 GB RAM, no GPU) for most dev work
Create a separate GPU instance (Standard_NC6s_v3) only for model prototyping — with a 30-minute idle shutdown

Monthly savings: ~60% reduction in dev compute costs.

Compute clusters: scaling for training

A compute cluster scales from 0 to N nodes based on demand. When no jobs are running, it scales to zero nodes — you pay nothing.

from azure.ai.ml.entities import AmlCompute

cluster = AmlCompute(
    name="gpu-training-cluster",
    type="amlcompute",
    size="Standard_NC24ads_A100_v4",
    min_instances=0,         # Scale to zero when idle
    max_instances=4,         # Up to 4 GPU nodes
    idle_time_before_scale_down=300,  # 5 minutes idle before removing a node
    tier="Dedicated"         # Dedicated (vs low-priority/spot)
)
ml_client.compute.begin_create_or_update(cluster).result()

What’s happening:

Line 6: A100 GPUs — serious training hardware
Line 7: min_instances=0 is the key cost control — no idle charges
Line 8: Up to 4 nodes for distributed training or parallel sweeps
Line 9: Nodes stay alive 5 minutes after a job ends (avoids cold-start for the next job)

Dedicated vs low-priority compute

	Dedicated	Low-Priority (Spot)
Cost	Full price	60-80% discount
Availability	Guaranteed	Can be pre-empted anytime
Best for	Production training, time-sensitive jobs	Long-running experiments, hyperparameter sweeps
Risk	None	Job may be interrupted and needs retry logic

Exam tip: When to use low-priority

The exam often tests cost optimisation scenarios. Key rule:

Low-priority is safe for: hyperparameter sweeps (many short jobs, can restart), non-urgent exploration
Low-priority is NOT safe for: production training on a deadline, single long-running job with no checkpointing

If the question mentions “budget constraints” and “hyperparameter tuning,” low-priority is usually correct.

Serverless compute

Serverless compute removes infrastructure management entirely. You submit a job, Azure provisions compute automatically, and charges per job.

# In a training job YAML
compute: serverless
resources:
  instance_type: Standard_NC24ads_A100_v4
  instance_count: 2

When to choose serverless over clusters:

Burst workloads — unpredictable training demand
No cluster management — don’t want to manage min/max nodes
Quick experiments — no setup needed

When to choose clusters over serverless:

Sustained workloads — cluster stays warm between jobs
Fine-grained control — node images, identity, networking
Cost predictability — known budget for allocated nodes

Inference compute (managed endpoints)

Models in production need dedicated compute for serving predictions. Azure ML managed online endpoints handle this:

# Create a managed online endpoint
az ml online-endpoint create \
  --name churn-predictor-endpoint \
  --resource-group rg-ml-prod \
  --workspace-name neuralspark-prod

Inference compute is covered in detail in Module 12 (Deploying Models). For now, know that:

Real-time endpoints use dedicated VMs that auto-scale based on request volume
Batch endpoints spin up compute clusters to process large datasets
You choose the VM size (SKU) for the endpoint deployment

Key terms flashcards

Question

Compute instance vs compute cluster?

Click or press Enter to reveal answer

Answer

Instance: single VM for dev/notebooks, one user, no auto-scaling. Cluster: 0 to N nodes for training, auto-scales to zero when idle, shared by many jobs.

Click to flip back

Question

What does min_instances=0 do on a compute cluster?

Click or press Enter to reveal answer

Answer

It enables scale-to-zero — when no jobs are queued, all nodes are de-allocated and you pay nothing. This is the most important cost control for training clusters.

Click to flip back

Question

When should you use low-priority (spot) compute?

Click or press Enter to reveal answer

Answer

For cost-sensitive, interruptible workloads like hyperparameter sweeps. NOT for time-critical production training or long-running jobs without checkpointing.

Click to flip back

Question

Serverless compute vs compute clusters?

Click or press Enter to reveal answer

Answer

Serverless: zero management, pay per job, good for burst. Clusters: more control, stay warm between jobs, better for sustained workloads and cost predictability.

Click to flip back

Knowledge check

Knowledge Check

NeuralSpark's cloud bill is too high. Kai discovers that data scientists are leaving compute instances running 24/7. What is the most effective fix?

Knowledge Check

Dr. Fatima is running a hyperparameter sweep with 200 trial combinations for a fraud detection model. The deadline is flexible — it just needs to finish this week. How should she configure compute to minimise cost?

Next up: Infrastructure as Code — deploying ML infrastructure with Bicep and GitHub Actions.