🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 4
Domain 4 — Module 7 of 8 88%
27 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 4: Deploy and Maintain Data Pipelines and Workloads Premium ⏱ ~14 min read

Spark Performance: DAG & Query Profile

Investigate and resolve caching, data skew, spilling, and shuffle issues using the DAG visualisation, Spark UI, and query profile.

Understanding Spark execution

☕ Simple explanation

The Spark UI is like an X-ray for your data pipeline.

When a query is slow, you need to see inside: Where is the bottleneck? Is one machine doing all the work while others sit idle (skew)? Is data overflowing from memory to disk (spilling)? Is too much data being sent between machines (shuffle)?

Three diagnostic tools help: the DAG (visual pipeline), the Spark UI (detailed metrics), and the Query Profile (SQL-specific execution plan).

Spark executes queries as a DAG (Directed Acyclic Graph) of stages and tasks. The Spark UI provides metrics for each stage (duration, shuffle read/write, spill, skew). The Query Profile (SQL warehouses) visualises the physical execution plan with operator-level statistics. Understanding these tools is essential for diagnosing performance issues.

The four performance villains

1. Data skew

Problem: One partition has far more data than others. One task takes 10 minutes while 199 tasks finish in 10 seconds.

Diagnosis (Spark UI):

  • Stages tab → look for one task with much higher duration/data processed
  • Task metrics show uneven distribution

Fixes:

# Salting: add a random prefix to the skewed key
from pyspark.sql.functions import concat, lit, floor, rand

df_salted = df.withColumn("salted_key",
    concat(col("customer_id"), lit("_"), floor(rand() * 10)))

# Or use Adaptive Query Execution (AQE) — enabled by default
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

2. Spilling

Problem: Data doesn’t fit in memory and overflows to disk. Disk I/O is 10-100x slower than memory.

Diagnosis (Spark UI):

  • Stages tab → “Spill (Memory)” and “Spill (Disk)” columns
  • Any non-zero spill indicates memory pressure

Fixes:

  • Increase executor memory
  • Increase number of partitions (smaller partitions per executor)
  • Reduce data being processed (push filters earlier)

3. Shuffle

Problem: Data must be redistributed across executors for operations like JOIN, GROUP BY, and DISTINCT. Shuffle = network transfer + disk I/O.

Diagnosis (Spark UI):

  • Stages tab → “Shuffle Read” and “Shuffle Write” columns
  • Large shuffle = potential bottleneck

Fixes:

  • Broadcast join: for small tables, broadcast to all executors (avoids shuffle)
    from pyspark.sql.functions import broadcast
    result = large_df.join(broadcast(small_df), "key")
  • Partition pruning: filter data BEFORE the join
  • Reduce partition count: spark.sql.shuffle.partitions (default 200)

4. Caching

Problem: Recomputing the same DataFrame multiple times wastes resources.

Diagnosis: Same stages appearing repeatedly in the DAG.

When to cache:

# Cache a DataFrame that's reused multiple times
expensive_df = spark.table("big_table").filter("year = 2026").cache()

# Use it multiple times
summary = expensive_df.groupBy("region").count()
details = expensive_df.filter("amount > 1000")

# Unpersist when done
expensive_df.unpersist()

When NOT to cache: DataFrames used only once, or when memory is limited.

Diagnostic tools

ToolWhat It ShowsBest For
DAG VisualisationVisual flow of stages and their relationshipsUnderstanding execution plan structure
Spark UI — StagesPer-stage metrics: duration, shuffle, spill, task countFinding slow stages and skew
Spark UI — SQLLogical and physical query plansUnderstanding join strategies and filter pushdown
Query ProfileSQL warehouse execution plan with operator statsSQL-specific query optimization
💡 Exam tip: AQE solves many issues automatically

Adaptive Query Execution (AQE) is enabled by default in recent Databricks runtimes. It automatically:

  • Coalesces small partitions after shuffle (reduces partition overhead)
  • Converts sort-merge joins to broadcast joins when one side is small
  • Handles skew joins by splitting skewed partitions

If the exam asks about the “easiest” or “simplest” fix for skew or shuffle issues, AQE is often the answer — it’s automatic and requires no code changes.

Question

What is data skew and how do you detect it?

Click or press Enter to reveal answer

Answer

Data skew = one partition has far more data than others, causing one task to take much longer. Detect in Spark UI → Stages → look for one task with disproportionately high duration or data processed.

Click to flip back

Question

What is spilling and what causes it?

Click or press Enter to reveal answer

Answer

Spilling occurs when data doesn't fit in executor memory and overflows to disk. Disk I/O is 10-100x slower. Fix: increase executor memory, increase partitions, or push filters earlier to reduce data volume.

Click to flip back

Question

When should you use a broadcast join?

Click or press Enter to reveal answer

Answer

When one table is small enough to fit in executor memory (typically under 10 MB). Broadcasting avoids shuffle — the small table is sent to all executors. Use broadcast() hint in PySpark.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Tomás notices that a Spark join in NovaPay's pipeline takes 45 minutes. In the Spark UI, he sees one task processed 90% of the data while 199 tasks each processed tiny amounts. What is the likely issue?


Next up: Optimizing Delta Tables & Azure Monitor — OPTIMIZE, VACUUM, log streaming, and Azure Monitor integration.

← Previous

Monitoring Clusters & Troubleshooting

Next →

Optimizing Delta Tables & Azure Monitor

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.