🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 — Module 3 of 10 30%
13 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~14 min read

Partitioning, Clustering & Table Optimization

Physical data layout strategies — partitioning schemes, liquid clustering, Z-ordering, and deletion vectors — that make or break query performance.

Why physical layout matters

☕ Simple explanation

Imagine a library. Partitioning is sorting books by genre onto separate shelves. Clustering is sorting within each shelf by author.

When someone asks for “Science Fiction by Asimov,” you go straight to the Sci-Fi shelf (partition pruning) and then to the A section (data skipping). Without this, you’d have to check every book in the library.

The same applies to data. A well-partitioned, well-clustered table lets Spark skip irrelevant files entirely — reading 10 files instead of 10,000.

Physical data layout determines how data files are organised on storage. Good layout enables partition pruning (skip entire directories) and data skipping (skip files based on min/max statistics). Poor layout forces full table scans.

Azure Databricks offers three layout strategies: Hive-style partitioning, Z-ordering, and the newer liquid clustering. Each has different trade-offs around write performance, query patterns, and maintenance.

Partitioning

Partitioning divides a table into physical directories based on column values:

-- Create a partitioned table
CREATE TABLE sales.bronze.transactions (
  txn_id BIGINT,
  store_id INT,
  amount DECIMAL(10,2),
  txn_date DATE
)
PARTITIONED BY (txn_date);

This creates a directory per date: /txn_date=2026-04-01/, /txn_date=2026-04-02/, etc.

When queries filter on txn_date, Spark reads ONLY the relevant directories.

Partitioning rules

RuleWhy
Partition on low-cardinality columnsToo many partitions = too many small files
Each partition should have at least 1 GB of dataSmall files hurt read performance
Partition on columns used in WHERE filtersPruning only works if queries filter on the partition column
Don’t over-partition10,000+ partitions = metadata overhead and small file problem
ℹ️ The small file problem

If you partition by a high-cardinality column (e.g., customer_id with 1 million values), you get 1 million tiny files. Each file has overhead (metadata, open/close operations), and Spark spends more time managing files than reading data.

Rule of thumb: If each partition would be less than 128 MB, don’t partition on that column — use Z-ordering or liquid clustering instead.

Z-ordering

Z-ordering co-locates related data within files based on one or more columns — without physical partitioning:

-- Z-order the table by region and product_category
OPTIMIZE sales.curated.daily_summary
  ZORDER BY (region, product_category);

Z-ordering sorts data across files so that rows with similar region and product_category values are stored together. This enables data skipping — Spark reads file-level min/max statistics and skips files that can’t contain matching values.

Z-ordering vs partitioning:

AspectPartitioningZ-ordering
Physical layoutSeparate directoriesCo-located within files
Best forLow-cardinality columns (date, region)Medium/high-cardinality columns
MaintenanceAutomatic (write-time)Manual (run OPTIMIZE ZORDER)
CombinableYes — partition + Z-orderZ-order within partitions

Liquid clustering (the modern approach)

Liquid clustering is Databricks’ newest layout strategy — it replaces both partitioning and Z-ordering for most use cases:

-- Create a table with liquid clustering
CREATE TABLE sales.curated.transactions
  CLUSTER BY (region, txn_date);

-- Trigger clustering (or let OPTIMIZE do it automatically)
OPTIMIZE sales.curated.transactions;
FeatureLiquid ClusteringTraditional PartitioningZ-Ordering
IncrementalYes (only re-clusters new data)N/ANo (re-processes entire table)
Change cluster keysYes (ALTER TABLE)No (must rewrite table)Yes
Write performanceGood (minimal overhead)GoodOverhead at OPTIMIZE time
FlexibilityHigh (change keys anytime)Low (locked at creation)Medium
Recommended forMost new tablesLegacy compatibilityLegacy tables

Exam default: For new tables, liquid clustering is the recommended approach. It’s more flexible than partitioning and more efficient than Z-ordering.

-- Change clustering keys without rewriting the table
ALTER TABLE sales.curated.transactions CLUSTER BY (store_id, txn_date);
💡 Exam tip: When to use which
  1. New table with predictable filter patterns → Liquid clustering
  2. Legacy table already partitioned → Keep partitioning + add Z-ordering
  3. High-cardinality filter column → Z-ordering or liquid clustering (never partition)
  4. Query filters change often → Liquid clustering (can change keys with ALTER TABLE)
  5. Very large table partitioned by date → Keep date partitioning + Z-order within partitions

Deletion vectors

Deletion vectors improve DELETE and MERGE performance by marking rows as deleted without rewriting entire files:

-- Enable deletion vectors
ALTER TABLE sales.curated.transactions
  SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

Without deletion vectors: DELETE rewrites every affected file (expensive for large files). With deletion vectors: A small metadata file marks which rows are deleted. Files are only rewritten during the next OPTIMIZE.

This is especially important for SCD Type 2 MERGE operations where you expire old rows — deletion vectors make the expire step nearly instant.

Question

What is liquid clustering and why is it recommended over partitioning?

Click or press Enter to reveal answer

Answer

Liquid clustering co-locates data by specified columns, is incremental (only re-clusters new data), and allows changing cluster keys with ALTER TABLE. Unlike partitioning, it doesn't create physical directories or the small file problem.

Click to flip back

Question

What is the key rule for choosing a partition column?

Click or press Enter to reveal answer

Answer

Partition on low-cardinality columns where each partition has at least 1 GB of data. If partitions would be smaller than 128 MB, use Z-ordering or liquid clustering instead.

Click to flip back

Question

What are deletion vectors and why do they matter?

Click or press Enter to reveal answer

Answer

Deletion vectors mark rows as deleted in a small metadata file instead of rewriting entire data files. This makes DELETE and MERGE much faster. Files are compacted during the next OPTIMIZE run.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Mei Lin's transactions table at Freshmart has 500 million rows. Analysts query it by region (5 values) and product_category (200 values). She's building a new table. Which layout strategy should she use?


Next up: Ingesting Data: Lakeflow Connect & Notebooks — batch and streaming ingestion with Lakeflow Connect and notebook code.

← Previous

SCD, Granularity & Temporal Tables

Next →

Ingesting Data: Lakeflow Connect & Notebooks

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.