Configuring Compute for Performance
CPU vs memory-optimised nodes, autoscaling, auto-termination, instance pools, Photon acceleration, Databricks Runtime versions, and library management — the exam loves these details.
Configuring compute: why it matters
Choosing a compute type is like choosing a vehicle. Configuring it is like tuning the engine.
You decided on a rental car (job compute). Now you need to pick: how many seats (nodes), petrol or diesel (CPU or memory-optimised), cruise control (autoscaling), auto-park timer (auto-termination), and whether to supercharge it (Photon).
Get this wrong and you either waste money (cluster too big) or jobs fail (cluster too small). The exam tests whether you can match configuration to workload requirements.
Node types and cluster sizing
Every cluster has a driver node (coordinates the work) and worker nodes (do the actual processing).
| Decision | Options | When to Choose |
|---|---|---|
| CPU-optimised | High vCPU count, moderate RAM | ETL pipelines with lots of transformations, Spark shuffles |
| Memory-optimised | High RAM, moderate vCPU | Large joins, caching, aggregations on wide tables |
| GPU-enabled | GPU attached | Machine learning training, deep learning |
| Storage-optimised | High local SSD | Workloads that spill heavily to disk |
When Ravi processes DataPulse’s 500GB nightly ETL, he picks memory-optimised nodes because the pipeline does heavy joins across customer and transaction tables.
Node count
- Min workers — the baseline. Set to 1+ for production workloads.
- Max workers — the ceiling for autoscaling.
- Driver node — typically same size or one tier larger than workers.
Exam tip: Driver vs. worker sizing
Common exam trap: the driver handles coordination, collects results, and runs non-distributed code. If your job uses collect() or returns large result sets to the driver, you may need a larger driver node than workers.
For most ETL workloads, driver and workers can be the same size. For ML workloads that aggregate results on the driver, size the driver up.
Autoscaling
Autoscaling adjusts the number of worker nodes based on workload demand:
- Optimised autoscaling (default for job clusters) — scales down aggressively to save cost
- Standard autoscaling (all-purpose clusters) — scales up and down based on pending tasks
| Setting | Recommendation |
|---|---|
| Min workers | Set to the baseline your workload always needs |
| Max workers | Set to handle peak load without over-provisioning |
| Scale-down time | Default is fine for most workloads |
When Mei Lin’s Freshmart data team runs ad-hoc queries during the day, autoscaling ramps up workers during peak hours and scales back to minimum at night.
Key exam fact: Autoscaling for job clusters (job compute) is optimised for batch — it scales down faster because the job has a finite end. Autoscaling for all-purpose clusters is more conservative because users are working interactively.
Auto-termination
Auto-termination shuts down a cluster after a period of inactivity:
- Default: 120 minutes (2 hours) for all-purpose clusters
- Job clusters: terminate immediately after the job completes (no idle timeout needed)
- Custom: set any idle timeout (10 minutes to 24 hours)
Exam scenario: “Ravi’s team forgets to stop their development cluster over the weekend, costing $800.” → The fix is enabling auto-termination with a 30-60 minute idle timeout.
Instance pools (cluster pooling)
Instance pools are pre-allocated sets of idle VMs that clusters can draw from:
- Without pool: cluster requests VMs from Azure → 3-7 minute startup
- With pool: cluster grabs pre-warmed VMs → 30-60 second startup
| Pool Setting | What It Does |
|---|---|
| Min idle instances | VMs kept warm and ready (you pay for these) |
| Max capacity | Maximum VMs the pool can hold |
| Idle instance auto-termination | How long unused VMs stay in the pool |
| Instance type | Fixed VM size — all instances in a pool are the same type |
Dr. Sarah Okafor sets up a pool at Athena Group so her team’s development clusters start in under a minute instead of waiting 5+ minutes each time.
Pools vs. autoscaling: how they work together
Pools and autoscaling are complementary:
- Pool = pre-allocated VMs ready to be assigned to any cluster
- Autoscaling = a specific cluster’s ability to add/remove workers
A cluster can use a pool AND autoscale. When it needs more workers, it grabs from the pool (fast). If the pool is empty, it falls back to requesting from Azure (slow).
Exam tip: Pools reduce startup latency, not compute cost. You still pay for idle instances sitting in the pool.
Photon acceleration
Photon is Databricks’ native vectorised query engine — a C++ replacement for parts of the Spark SQL engine:
- 2-8x faster for SQL and DataFrame workloads
- Especially effective for scans, joins, aggregations, and sorting
- Default on SQL warehouses and serverless compute
- Opt-in on job and all-purpose clusters (select a Photon-enabled runtime)
- Uses more DBUs (Photon DBU rate is higher than standard)
| With Photon | Without Photon |
|---|---|
| Faster queries (vectorised execution) | Standard Spark execution |
| Higher DBU rate | Standard DBU rate |
| Best for SQL-heavy workloads | Better for Python/ML workloads |
Exam pattern: If the question mentions “optimise query performance” or “SQL-heavy workload” — Photon is likely the answer.
Databricks Runtime versions
The Databricks Runtime is the software image that runs on each node. Key versions:
| Runtime | Use Case |
|---|---|
| Databricks Runtime (standard) | General data engineering |
| Databricks Runtime ML | Machine learning — includes MLflow, PyTorch, TensorFlow, scikit-learn |
| Photon Runtime | SQL-heavy workloads (includes Photon engine) |
Each runtime version is tied to a Spark version (e.g., Runtime 15.x = Spark 3.5.x). You should:
- Use the latest LTS (Long Term Support) for production
- Match runtime across dev, staging, and production clusters
- Use ML Runtime only when you need ML libraries (it’s heavier)
Installing libraries
Clusters need libraries (Python packages, JARs, etc.) for custom code:
| Library Scope | How It Works | Use Case |
|---|---|---|
| Cluster library | Installed on all nodes when cluster starts | Shared packages for the team |
| Notebook-scoped | %pip install in a notebook cell | Quick experiments, per-notebook deps |
| Workspace library | Uploaded to workspace, attached to clusters | Org-wide packages |
| Init scripts | Shell scripts that run on cluster startup | Complex setup (system packages, env vars) |
# Notebook-scoped installation (recommended for dev)
%pip install great-expectations pandas-profiling
# After install, restart the Python interpreter
dbutils.library.restartPython()
Exam tip: %pip install is notebook-scoped and doesn’t affect other users on the same cluster. For production, use cluster libraries or init scripts so dependencies are reproducible.
🎬 Video coming soon
Knowledge check
Tomás runs a Spark Structured Streaming job at NovaPay that processes real-time transaction data. The workload is variable — quiet during nights, heavy during business hours. He wants to minimise cost without manual intervention. Which two settings should he configure?
Dr. Sarah Okafor's team at Athena Group complains that cluster startup takes 5-7 minutes each morning. She wants to reduce this to under a minute without keeping clusters running overnight. What should she configure?
Ravi wants to accelerate DataPulse's SQL-heavy ETL pipeline that performs many joins and aggregations. The pipeline currently runs on Databricks Runtime 15.4 standard. What is the MOST effective change?
Next up: Unity Catalog: The Three-Level Namespace — naming conventions, catalogs, schemas, and volumes.