Spark Performance: DAG & Query Profile
Investigate and resolve caching, data skew, spilling, and shuffle issues using the DAG visualisation, Spark UI, and query profile.
Understanding Spark execution
The Spark UI is like an X-ray for your data pipeline.
When a query is slow, you need to see inside: Where is the bottleneck? Is one machine doing all the work while others sit idle (skew)? Is data overflowing from memory to disk (spilling)? Is too much data being sent between machines (shuffle)?
Three diagnostic tools help: the DAG (visual pipeline), the Spark UI (detailed metrics), and the Query Profile (SQL-specific execution plan).
The four performance villains
1. Data skew
Problem: One partition has far more data than others. One task takes 10 minutes while 199 tasks finish in 10 seconds.
Diagnosis (Spark UI):
- Stages tab → look for one task with much higher duration/data processed
- Task metrics show uneven distribution
Fixes:
# Salting: add a random prefix to the skewed key
from pyspark.sql.functions import concat, lit, floor, rand
df_salted = df.withColumn("salted_key",
concat(col("customer_id"), lit("_"), floor(rand() * 10)))
# Or use Adaptive Query Execution (AQE) — enabled by default
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
2. Spilling
Problem: Data doesn’t fit in memory and overflows to disk. Disk I/O is 10-100x slower than memory.
Diagnosis (Spark UI):
- Stages tab → “Spill (Memory)” and “Spill (Disk)” columns
- Any non-zero spill indicates memory pressure
Fixes:
- Increase executor memory
- Increase number of partitions (smaller partitions per executor)
- Reduce data being processed (push filters earlier)
3. Shuffle
Problem: Data must be redistributed across executors for operations like JOIN, GROUP BY, and DISTINCT. Shuffle = network transfer + disk I/O.
Diagnosis (Spark UI):
- Stages tab → “Shuffle Read” and “Shuffle Write” columns
- Large shuffle = potential bottleneck
Fixes:
- Broadcast join: for small tables, broadcast to all executors (avoids shuffle)
from pyspark.sql.functions import broadcast result = large_df.join(broadcast(small_df), "key") - Partition pruning: filter data BEFORE the join
- Reduce partition count:
spark.sql.shuffle.partitions(default 200)
4. Caching
Problem: Recomputing the same DataFrame multiple times wastes resources.
Diagnosis: Same stages appearing repeatedly in the DAG.
When to cache:
# Cache a DataFrame that's reused multiple times
expensive_df = spark.table("big_table").filter("year = 2026").cache()
# Use it multiple times
summary = expensive_df.groupBy("region").count()
details = expensive_df.filter("amount > 1000")
# Unpersist when done
expensive_df.unpersist()
When NOT to cache: DataFrames used only once, or when memory is limited.
Diagnostic tools
| Tool | What It Shows | Best For |
|---|---|---|
| DAG Visualisation | Visual flow of stages and their relationships | Understanding execution plan structure |
| Spark UI — Stages | Per-stage metrics: duration, shuffle, spill, task count | Finding slow stages and skew |
| Spark UI — SQL | Logical and physical query plans | Understanding join strategies and filter pushdown |
| Query Profile | SQL warehouse execution plan with operator stats | SQL-specific query optimization |
Exam tip: AQE solves many issues automatically
Adaptive Query Execution (AQE) is enabled by default in recent Databricks runtimes. It automatically:
- Coalesces small partitions after shuffle (reduces partition overhead)
- Converts sort-merge joins to broadcast joins when one side is small
- Handles skew joins by splitting skewed partitions
If the exam asks about the “easiest” or “simplest” fix for skew or shuffle issues, AQE is often the answer — it’s automatic and requires no code changes.
🎬 Video coming soon
Knowledge check
Tomás notices that a Spark join in NovaPay's pipeline takes 45 minutes. In the Spark UI, he sees one task processed 90% of the data while 199 tasks each processed tiny amounts. What is the likely issue?
Next up: Optimizing Delta Tables & Azure Monitor — OPTIMIZE, VACUUM, log streaming, and Azure Monitor integration.