Configuring Compute for Performance

Configuring compute: why it matters

Simple explanation

Choosing a compute type is like choosing a vehicle. Configuring it is like tuning the engine.

You decided on a rental car (job compute). Now you need to pick: how many seats (nodes), petrol or diesel (CPU or memory-optimised), cruise control (autoscaling), auto-park timer (auto-termination), and whether to supercharge it (Photon).

Get this wrong and you either waste money (cluster too big) or jobs fail (cluster too small). The exam tests whether you can match configuration to workload requirements.

Node types and cluster sizing

Every cluster has a driver node (coordinates the work) and worker nodes (do the actual processing).

Decision	Options	When to Choose
CPU-optimised	High vCPU count, moderate RAM	ETL pipelines with lots of transformations, Spark shuffles
Memory-optimised	High RAM, moderate vCPU	Large joins, caching, aggregations on wide tables
GPU-enabled	GPU attached	Machine learning training, deep learning
Storage-optimised	High local SSD	Workloads that spill heavily to disk

When Ravi processes DataPulse’s 500GB nightly ETL, he picks memory-optimised nodes because the pipeline does heavy joins across customer and transaction tables.

Node count

Min workers — the baseline. Set to 1+ for production workloads.
Max workers — the ceiling for autoscaling.
Driver node — typically same size or one tier larger than workers.

Exam tip: Driver vs. worker sizing

Common exam trap: the driver handles coordination, collects results, and runs non-distributed code. If your job uses collect() or returns large result sets to the driver, you may need a larger driver node than workers.

For most ETL workloads, driver and workers can be the same size. For ML workloads that aggregate results on the driver, size the driver up.

Autoscaling

Autoscaling adjusts the number of worker nodes based on workload demand:

Optimised autoscaling (default for job clusters) — scales down aggressively to save cost
Standard autoscaling (all-purpose clusters) — scales up and down based on pending tasks

Setting	Recommendation
Min workers	Set to the baseline your workload always needs
Max workers	Set to handle peak load without over-provisioning
Scale-down time	Default is fine for most workloads

When Mei Lin’s Freshmart data team runs ad-hoc queries during the day, autoscaling ramps up workers during peak hours and scales back to minimum at night.

Key exam fact: Autoscaling for job clusters (job compute) is optimised for batch — it scales down faster because the job has a finite end. Autoscaling for all-purpose clusters is more conservative because users are working interactively.

Auto-termination

Auto-termination shuts down a cluster after a period of inactivity:

Default: 120 minutes (2 hours) for all-purpose clusters
Job clusters: terminate immediately after the job completes (no idle timeout needed)
Custom: set any idle timeout (10 minutes to 24 hours)

Exam scenario: “Ravi’s team forgets to stop their development cluster over the weekend, costing $800.” → The fix is enabling auto-termination with a 30-60 minute idle timeout.

Instance pools (cluster pooling)

Instance pools are pre-allocated sets of idle VMs that clusters can draw from:

Without pool: cluster requests VMs from Azure → 3-7 minute startup
With pool: cluster grabs pre-warmed VMs → 30-60 second startup

Pool Setting	What It Does
Min idle instances	VMs kept warm and ready (you pay for these)
Max capacity	Maximum VMs the pool can hold
Idle instance auto-termination	How long unused VMs stay in the pool
Instance type	Fixed VM size — all instances in a pool are the same type

Dr. Sarah Okafor sets up a pool at Athena Group so her team’s development clusters start in under a minute instead of waiting 5+ minutes each time.

Pools vs. autoscaling: how they work together

Pools and autoscaling are complementary:

Pool = pre-allocated VMs ready to be assigned to any cluster
Autoscaling = a specific cluster’s ability to add/remove workers

A cluster can use a pool AND autoscale. When it needs more workers, it grabs from the pool (fast). If the pool is empty, it falls back to requesting from Azure (slow).

Exam tip: Pools reduce startup latency, not compute cost. You still pay for idle instances sitting in the pool.

Photon acceleration

Photon is Databricks’ native vectorised query engine — a C++ replacement for parts of the Spark SQL engine:

2-8x faster for SQL and DataFrame workloads
Especially effective for scans, joins, aggregations, and sorting
Default on SQL warehouses and serverless compute
Opt-in on job and all-purpose clusters (select a Photon-enabled runtime)
Uses more DBUs (Photon DBU rate is higher than standard)

With Photon	Without Photon
Faster queries (vectorised execution)	Standard Spark execution
Higher DBU rate	Standard DBU rate
Best for SQL-heavy workloads	Better for Python/ML workloads

Exam pattern: If the question mentions “optimise query performance” or “SQL-heavy workload” — Photon is likely the answer.

Databricks Runtime versions

The Databricks Runtime is the software image that runs on each node. Key versions:

Runtime	Use Case
Databricks Runtime (standard)	General data engineering
Databricks Runtime ML	Machine learning — includes MLflow, PyTorch, TensorFlow, scikit-learn
Photon Runtime	SQL-heavy workloads (includes Photon engine)

Each runtime version is tied to a Spark version (e.g., Runtime 15.x = Spark 3.5.x). You should:

Use the latest LTS (Long Term Support) for production
Match runtime across dev, staging, and production clusters
Use ML Runtime only when you need ML libraries (it’s heavier)

Question

What is Photon in Azure Databricks?

Click or press Enter to reveal answer

Answer

Photon is Databricks' native vectorised query engine (written in C++). It accelerates SQL and DataFrame workloads 2-8x but uses a higher DBU rate. It's default on SQL warehouses and opt-in for job/all-purpose clusters.

Click to flip back

Question

What is an instance pool in Databricks?

Click or press Enter to reveal answer

Answer

A pre-allocated set of idle VMs that clusters can draw from for faster startup (30-60 seconds vs. 3-7 minutes). You pay for idle instances in the pool. Pools reduce startup latency, not compute cost.

Click to flip back

Question

What is the difference between autoscaling on job clusters vs. all-purpose clusters?

Click or press Enter to reveal answer

Answer

Job cluster autoscaling is optimised for batch — scales down aggressively because jobs have a finite end. All-purpose cluster autoscaling is more conservative because users are working interactively.

Click to flip back

Installing libraries

Clusters need libraries (Python packages, JARs, etc.) for custom code:

Library Scope	How It Works	Use Case
Cluster library	Installed on all nodes when cluster starts	Shared packages for the team
Notebook-scoped	`%pip install` in a notebook cell	Quick experiments, per-notebook deps
Workspace library	Uploaded to workspace, attached to clusters	Org-wide packages
Init scripts	Shell scripts that run on cluster startup	Complex setup (system packages, env vars)

# Notebook-scoped installation (recommended for dev)
%pip install great-expectations pandas-profiling

# After install, restart the Python interpreter
dbutils.library.restartPython()

Exam tip: %pip install is notebook-scoped and doesn’t affect other users on the same cluster. For production, use cluster libraries or init scripts so dependencies are reproducible.

Question

What are the four ways to install libraries on a Databricks cluster?

Click or press Enter to reveal answer

Answer

1) Cluster library (installed on all nodes at startup), 2) Notebook-scoped (%pip install, per-notebook), 3) Workspace library (uploaded, attached to clusters), 4) Init scripts (shell scripts on startup).

Click to flip back

Knowledge check

Knowledge Check

Tomás runs a Spark Structured Streaming job at NovaPay that processes real-time transaction data. The workload is variable — quiet during nights, heavy during business hours. He wants to minimise cost without manual intervention. Which setting should he configure?

Knowledge Check

Dr. Sarah Okafor's team at Athena Group complains that cluster startup takes 5-7 minutes each morning. She wants to reduce this to under a minute without keeping clusters running overnight. What should she configure?

Knowledge Check

Ravi wants to accelerate DataPulse's SQL-heavy ETL pipeline that performs many joins and aggregations. The pipeline currently runs on Databricks Runtime 15.4 standard. What is the MOST effective change?

Next up: Unity Catalog: The Three-Level Namespace — naming conventions, catalogs, schemas, and volumes.