πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-700 Domain 3
Domain 3 β€” Module 6 of 8 75%
24 of 26 overall

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance
Domain 3: Monitor and Optimize an Analytics Solution Premium ⏱ ~13 min read

Optimize Spark: Speed Up Your Code

Tune Spark performance with partitioning, caching, broadcast joins, predicate pushdown, and pool configuration for faster notebook execution.

Making Spark faster

β˜• Simple explanation

Think of moving house with a team of helpers.

If one person carries all the heavy boxes while others carry nothing, the move takes forever (data skew). If helpers keep going back to the old house to look at the same boxes (no caching), they waste time. If you ship a grand piano across town when you could carry it next door (shuffle), you’re doing unnecessary work.

Spark optimization is about distributing work evenly, caching things you reuse, and minimizing data movement across the cluster.

Spark performance optimization covers four areas: partitioning (how data is distributed across executors), caching (keeping frequently accessed data in memory), join optimization (broadcast joins for small tables, sort-merge for large), and configuration (executor count, memory, shuffle settings). The Spark UI is the diagnostic tool for identifying bottlenecks.

Partitioning

Data partitioning (storage-level)

Partition your Delta tables by columns used in WHERE filters:

df.write.format("delta") \
    .partitionBy("year", "month") \
    .mode("overwrite") \
    .save("Tables/FactOrders")

Effect: Spark reads only the partitions needed. WHERE year = 2026 AND month = 4 reads one folder instead of scanning the entire table.

When to partition: High-cardinality time columns (year/month), or columns always present in WHERE filters. Don’t over-partition: 10,000 partitions with 1 MB each is worse than 100 partitions with 100 MB each (small file problem).

Shuffle partitioning (runtime-level)

# Default: 200 shuffle partitions (often too many for small datasets)
spark.conf.set("spark.sql.shuffle.partitions", 50)

Rule of thumb: Set shuffle partitions to 2-3x the number of executor cores.

Caching

Cache DataFrames you read multiple times:

# Cache in memory
df_customers = spark.table("lakehouse.DimCustomer").cache()

# Use it multiple times
joined_orders = df_orders.join(df_customers, "customer_id")
joined_returns = df_returns.join(df_customers, "customer_id")

# Uncache when done
df_customers.unpersist()
Cache frequently reused DataFrames; unpersist when done
Storage LevelWhereWhen to Use
MEMORY_ONLY (.cache())Executor RAMSmall DataFrames read multiple times
MEMORY_AND_DISKRAM first, spill to diskMedium DataFrames that might not fit entirely in memory
DISK_ONLYExecutor diskLarge DataFrames read multiple times, where re-reading source is expensive

Join optimization

Broadcast joins

When one table is small (under ~10 MB), Spark can broadcast it to every executor β€” eliminating the expensive shuffle of the large table.

from pyspark.sql.functions import broadcast

# Force broadcast of the small dimension table
df_result = df_orders.join(
    broadcast(df_products),   # Send products to every executor
    "product_id"
)

Effect: Instead of shuffling 500M order rows, Spark sends the 5,000-row product table to every executor. Each executor joins locally. Massive speed improvement.

Sort-merge joins (default for large tables)

When both tables are large, Spark sorts both by the join key, then merges. This requires a shuffle (expensive but necessary).

Optimization: Ensure both tables are partitioned or bucketed by the join key to reduce shuffle data volume.

Broadcast small tables; sort-merge for large-to-large joins
Join TypeWhenPerformance
Broadcast joinOne table is small (<10 MB default threshold)Fast β€” no shuffle of the large table
Sort-merge joinBoth tables are largeModerate β€” requires shuffle and sort of both tables
Shuffle hash joinOne table is moderately sizedBetween broadcast and sort-merge
πŸ’‘ Exam tip: Broadcast join threshold

The default broadcast threshold is 10 MB (spark.sql.autoBroadcastJoinThreshold). Tables under this size are automatically broadcast.

Exam pattern: β€œA join is slow despite one table being small.” Check if the small table is just above the threshold. Fix: increase the threshold or explicitly use broadcast().

Predicate pushdown

Spark pushes WHERE filters down to the file level β€” reading only relevant data.

# Good: filter early β†’ Spark reads fewer files
df = spark.table("lakehouse.FactOrders") \
    .filter(col("order_date") >= "2026-04-01") \
    .filter(col("status") == "completed") \
    .join(df_customers, "customer_id")

# Bad: join first, filter later β†’ reads entire table before filtering
df = spark.table("lakehouse.FactOrders") \
    .join(df_customers, "customer_id") \
    .filter(col("order_date") >= "2026-04-01")

Rule: Filter as early as possible in the transformation chain.

Pool configuration

SettingEffectRecommendation
Node countMore nodes = more parallelismScale up for large jobs, scale down for cost
Node sizeMore memory/cores per nodeIncrease for memory-heavy operations (large broadcasts, wide joins)
Auto-scaleAdds/removes nodes based on demandEnable for variable workloads
TimeoutHow long idle sessions stay alive10-30 min for interactive; shorter for automated runs

Question

What is a broadcast join and when should you use it?

Click or press Enter to reveal answer

Answer

A broadcast join copies a small table to every Spark executor, so the large table doesn't need to shuffle. Use when one join table is small (<10 MB). Explicitly call broadcast(df_small) or increase autoBroadcastJoinThreshold.

Click to flip back

Question

What is predicate pushdown in Spark?

Click or press Enter to reveal answer

Answer

Spark pushes WHERE filters down to the file/partition level, reading only relevant data. Filter EARLY in the transformation chain (before joins) to minimize data processed.

Click to flip back

Question

When should you cache a DataFrame?

Click or press Enter to reveal answer

Answer

When you read it multiple times in the same notebook (e.g., joining it with different tables). Don't cache DataFrames you only use once. Always unpersist() when done to free memory.

Click to flip back


Knowledge Check

Carlos's notebook joins a 500M-row FactProduction table with a 3,000-row DimProduct table. The join takes 8 minutes. What optimization would have the most impact?

Knowledge Check

A Spark notebook filters FactOrders (1B rows) after joining with DimCustomer. Moving the filter BEFORE the join reduces execution time by 70%. What optimization principle is this?

🎬 Video coming soon

Next up: Optimize Pipelines & Warehouses β€” tune pipeline performance and warehouse query execution.

← Previous

Optimize Lakehouse Tables: Delta Tuning

Next β†’

Optimize Pipelines & Warehouses

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.