🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 1
Domain 1 — Module 3 of 5 60%
3 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 1: Set Up and Configure an Azure Databricks Environment Free ⏱ ~15 min read

Configuring Compute for Performance

CPU vs memory-optimised nodes, autoscaling, auto-termination, instance pools, Photon acceleration, Databricks Runtime versions, and library management — the exam loves these details.

Configuring compute: why it matters

☕ Simple explanation

Choosing a compute type is like choosing a vehicle. Configuring it is like tuning the engine.

You decided on a rental car (job compute). Now you need to pick: how many seats (nodes), petrol or diesel (CPU or memory-optimised), cruise control (autoscaling), auto-park timer (auto-termination), and whether to supercharge it (Photon).

Get this wrong and you either waste money (cluster too big) or jobs fail (cluster too small). The exam tests whether you can match configuration to workload requirements.

Compute configuration in Azure Databricks involves selecting the node type (VM size), cluster topology (driver + workers), autoscaling policy, termination settings, Databricks Runtime version, and optional accelerators like Photon. Each setting directly impacts cost, performance, and reliability.

You also manage libraries (Python packages, JARs) and can use instance pools to pre-allocate VMs for faster cluster startup. The exam expects you to recommend configurations for specific workload patterns.

Node types and cluster sizing

Every cluster has a driver node (coordinates the work) and worker nodes (do the actual processing).

DecisionOptionsWhen to Choose
CPU-optimisedHigh vCPU count, moderate RAMETL pipelines with lots of transformations, Spark shuffles
Memory-optimisedHigh RAM, moderate vCPULarge joins, caching, aggregations on wide tables
GPU-enabledGPU attachedMachine learning training, deep learning
Storage-optimisedHigh local SSDWorkloads that spill heavily to disk

When Ravi processes DataPulse’s 500GB nightly ETL, he picks memory-optimised nodes because the pipeline does heavy joins across customer and transaction tables.

Node count

  • Min workers — the baseline. Set to 1+ for production workloads.
  • Max workers — the ceiling for autoscaling.
  • Driver node — typically same size or one tier larger than workers.
💡 Exam tip: Driver vs. worker sizing

Common exam trap: the driver handles coordination, collects results, and runs non-distributed code. If your job uses collect() or returns large result sets to the driver, you may need a larger driver node than workers.

For most ETL workloads, driver and workers can be the same size. For ML workloads that aggregate results on the driver, size the driver up.

Autoscaling

Autoscaling adjusts the number of worker nodes based on workload demand:

  • Optimised autoscaling (default for job clusters) — scales down aggressively to save cost
  • Standard autoscaling (all-purpose clusters) — scales up and down based on pending tasks
SettingRecommendation
Min workersSet to the baseline your workload always needs
Max workersSet to handle peak load without over-provisioning
Scale-down timeDefault is fine for most workloads

When Mei Lin’s Freshmart data team runs ad-hoc queries during the day, autoscaling ramps up workers during peak hours and scales back to minimum at night.

Key exam fact: Autoscaling for job clusters (job compute) is optimised for batch — it scales down faster because the job has a finite end. Autoscaling for all-purpose clusters is more conservative because users are working interactively.

Auto-termination

Auto-termination shuts down a cluster after a period of inactivity:

  • Default: 120 minutes (2 hours) for all-purpose clusters
  • Job clusters: terminate immediately after the job completes (no idle timeout needed)
  • Custom: set any idle timeout (10 minutes to 24 hours)

Exam scenario: “Ravi’s team forgets to stop their development cluster over the weekend, costing $800.” → The fix is enabling auto-termination with a 30-60 minute idle timeout.

Instance pools (cluster pooling)

Instance pools are pre-allocated sets of idle VMs that clusters can draw from:

  • Without pool: cluster requests VMs from Azure → 3-7 minute startup
  • With pool: cluster grabs pre-warmed VMs → 30-60 second startup
Pool SettingWhat It Does
Min idle instancesVMs kept warm and ready (you pay for these)
Max capacityMaximum VMs the pool can hold
Idle instance auto-terminationHow long unused VMs stay in the pool
Instance typeFixed VM size — all instances in a pool are the same type

Dr. Sarah Okafor sets up a pool at Athena Group so her team’s development clusters start in under a minute instead of waiting 5+ minutes each time.

ℹ️ Pools vs. autoscaling: how they work together

Pools and autoscaling are complementary:

  • Pool = pre-allocated VMs ready to be assigned to any cluster
  • Autoscaling = a specific cluster’s ability to add/remove workers

A cluster can use a pool AND autoscale. When it needs more workers, it grabs from the pool (fast). If the pool is empty, it falls back to requesting from Azure (slow).

Exam tip: Pools reduce startup latency, not compute cost. You still pay for idle instances sitting in the pool.

Photon acceleration

Photon is Databricks’ native vectorised query engine — a C++ replacement for parts of the Spark SQL engine:

  • 2-8x faster for SQL and DataFrame workloads
  • Especially effective for scans, joins, aggregations, and sorting
  • Default on SQL warehouses and serverless compute
  • Opt-in on job and all-purpose clusters (select a Photon-enabled runtime)
  • Uses more DBUs (Photon DBU rate is higher than standard)
With PhotonWithout Photon
Faster queries (vectorised execution)Standard Spark execution
Higher DBU rateStandard DBU rate
Best for SQL-heavy workloadsBetter for Python/ML workloads

Exam pattern: If the question mentions “optimise query performance” or “SQL-heavy workload” — Photon is likely the answer.

Databricks Runtime versions

The Databricks Runtime is the software image that runs on each node. Key versions:

RuntimeUse Case
Databricks Runtime (standard)General data engineering
Databricks Runtime MLMachine learning — includes MLflow, PyTorch, TensorFlow, scikit-learn
Photon RuntimeSQL-heavy workloads (includes Photon engine)

Each runtime version is tied to a Spark version (e.g., Runtime 15.x = Spark 3.5.x). You should:

  • Use the latest LTS (Long Term Support) for production
  • Match runtime across dev, staging, and production clusters
  • Use ML Runtime only when you need ML libraries (it’s heavier)
Question

What is Photon in Azure Databricks?

Click or press Enter to reveal answer

Answer

Photon is Databricks' native vectorised query engine (written in C++). It accelerates SQL and DataFrame workloads 2-8x but uses a higher DBU rate. It's default on SQL warehouses and opt-in for job/all-purpose clusters.

Click to flip back

Question

What is an instance pool in Databricks?

Click or press Enter to reveal answer

Answer

A pre-allocated set of idle VMs that clusters can draw from for faster startup (30-60 seconds vs. 3-7 minutes). You pay for idle instances in the pool. Pools reduce startup latency, not compute cost.

Click to flip back

Question

What is the difference between autoscaling on job clusters vs. all-purpose clusters?

Click or press Enter to reveal answer

Answer

Job cluster autoscaling is optimised for batch — scales down aggressively because jobs have a finite end. All-purpose cluster autoscaling is more conservative because users are working interactively.

Click to flip back

Installing libraries

Clusters need libraries (Python packages, JARs, etc.) for custom code:

Library ScopeHow It WorksUse Case
Cluster libraryInstalled on all nodes when cluster startsShared packages for the team
Notebook-scoped%pip install in a notebook cellQuick experiments, per-notebook deps
Workspace libraryUploaded to workspace, attached to clustersOrg-wide packages
Init scriptsShell scripts that run on cluster startupComplex setup (system packages, env vars)
# Notebook-scoped installation (recommended for dev)
%pip install great-expectations pandas-profiling

# After install, restart the Python interpreter
dbutils.library.restartPython()

Exam tip: %pip install is notebook-scoped and doesn’t affect other users on the same cluster. For production, use cluster libraries or init scripts so dependencies are reproducible.

Question

What are the four ways to install libraries on a Databricks cluster?

Click or press Enter to reveal answer

Answer

1) Cluster library (installed on all nodes at startup), 2) Notebook-scoped (%pip install, per-notebook), 3) Workspace library (uploaded, attached to clusters), 4) Init scripts (shell scripts on startup).

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Tomás runs a Spark Structured Streaming job at NovaPay that processes real-time transaction data. The workload is variable — quiet during nights, heavy during business hours. He wants to minimise cost without manual intervention. Which two settings should he configure?

Knowledge Check

Dr. Sarah Okafor's team at Athena Group complains that cluster startup takes 5-7 minutes each morning. She wants to reduce this to under a minute without keeping clusters running overnight. What should she configure?

Knowledge Check

Ravi wants to accelerate DataPulse's SQL-heavy ETL pipeline that performs many joins and aggregations. The pipeline currently runs on Databricks Runtime 15.4 standard. What is the MOST effective change?


Next up: Unity Catalog: The Three-Level Namespace — naming conventions, catalogs, schemas, and volumes.

← Previous

Choosing the Right Compute

Next →

Unity Catalog: The Three-Level Namespace

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.