Optimize Lakehouse Tables: Delta Tuning

Why optimize Delta tables?

Simple explanation

Think of a filing cabinet that nobody organises.

After months of use, you have thousands of tiny folders (small files), old documents that should have been shredded (orphaned files), and no logical order to anything (unsorted data). Finding something takes forever.

Delta table optimization is like hiring a professional organiser: OPTIMIZE combines tiny files into bigger ones (compaction). VACUUM removes old files that are no longer needed. Z-ordering sorts data so queries on common columns skip irrelevant files. V-ordering is Fabric’s special compression for even faster reads.

OPTIMIZE (compaction)

Combines many small files into fewer larger files. This is the single most impactful optimization.

-- Compact all files in the table
OPTIMIZE lakehouse.FactOrders

-- Compact with Z-ordering on a specific column
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date)

-- Compact a specific partition
OPTIMIZE lakehouse.FactOrders WHERE year = 2026

Before and after

Metric	Before OPTIMIZE	After OPTIMIZE
Number of files	12,847 small files	156 optimally-sized files
Average file size	2 MB	128 MB
Query scan time	45 seconds	6 seconds

Why small files are a problem

Each Parquet file has overhead: file open, metadata read, footer parsing. With 12,847 files, Spark spends more time on overhead than actual data reading.

OPTIMIZE reads all the small files and rewrites them as fewer, optimally-sized files (target: ~128 MB each). The data is identical — only the physical file layout changes.

Best practice: Run OPTIMIZE after batch loads, especially after many small appends or streaming micro-batches.

VACUUM (cleanup)

Removes files that are no longer referenced by the transaction log. These are leftover from UPDATE, DELETE, MERGE, or OPTIMIZE operations.

-- Remove files older than 7 days (default retention)
VACUUM lakehouse.FactOrders

-- Remove files older than 24 hours (careful — reduces time travel window)
VACUUM lakehouse.FactOrders RETAIN 24 HOURS

What VACUUM does NOT delete

Current data files (referenced by the latest transaction log)
Files within the retention period (default: 7 days)
The transaction log itself (_delta_log/)

Exam tip: VACUUM and time travel

Critical trade-off: VACUUM removes old file versions. If you VACUUM with a 24-hour retention, you can only time travel back 24 hours. Files needed for older versions are gone.

Exam pattern: “After running VACUUM, time travel to version 5 fails.” Why? VACUUM deleted the files that version 5 referenced.

Safe default: Keep the 7-day retention period unless storage cost is critical.

Z-ordering

Z-ordering physically sorts data within files by specified columns. This enables data skipping — Spark reads file metadata (min/max values per column) and skips files that can’t contain matching rows.

-- Z-order by the most commonly filtered column
OPTIMIZE lakehouse.FactOrders ZORDER BY (customer_id)

-- Z-order by multiple columns (first column gets priority)
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date, region)

When to Z-order

Scenario	Z-Order Column	Why
Queries filter by date	`order_date`	Date range queries skip non-matching files
Queries filter by customer	`customer_id`	Customer lookups read fewer files
Queries join on product_id	`product_id`	Join scans only relevant files

Scenario: Carlos Z-orders production data

Carlos’s FactProduction table has 500M rows. Most queries filter by factory_id and production_date. Without Z-ordering, every query scans all files.

After running OPTIMIZE FactProduction ZORDER BY (production_date, factory_id):

A query for “Factory F01, April 2026” scans 12 files instead of 1,500
Query time drops from 35 seconds to 3 seconds

V-ordering

V-ordering is Fabric’s proprietary write-time optimization that applies special sorting and compression to Parquet files. It’s enabled by default for Fabric lakehouse tables.

Z-ordering is manual + column-specific; V-ordering is automatic + engine-wide
Feature	Z-ordering	V-ordering
When applied	Manually with OPTIMIZE command	Automatically on write (default in Fabric)
What it does	Sorts data by specific columns for data skipping	Applies optimized compression and read-path optimization
Configured by	You choose the columns	Enabled explicitly via table properties, session config, or OPTIMIZE VORDER option
Works with	Spark engine (data skipping)	All Fabric engines (Spark, SQL endpoint, Power BI Direct Lake)
Relationship	Can be combined with V-ordering	Complements Z-ordering

Table maintenance strategy

Task	Frequency	Command
OPTIMIZE	After batch loads, daily for heavy tables	`OPTIMIZE table_name`
OPTIMIZE + Z-ORDER	Weekly or after major loads	`OPTIMIZE table ZORDER BY (column)`
VACUUM	Weekly	`VACUUM table_name`
Analyze table	After major schema or data changes	`ANALYZE TABLE table_name COMPUTE STATISTICS`

Question

What does OPTIMIZE do to a Delta table?

Click or press Enter to reveal answer

Answer

Combines many small files into fewer, optimally-sized files (~128 MB each). This reduces file open overhead and dramatically improves query scan time. Data content is unchanged — only the physical layout changes.

Click to flip back

Question

What is the trade-off of running VACUUM with a short retention period?

Click or press Enter to reveal answer

Answer

VACUUM removes old file versions. Short retention (e.g., 24 hours) frees storage but limits time travel — you can only query versions within the retention window. Default 7-day retention is the safe balance.

Click to flip back

Question

What is Z-ordering and when should you use it?

Click or press Enter to reveal answer

Answer

Z-ordering sorts data within files by specified columns, enabling data skipping (Spark reads min/max metadata and skips irrelevant files). Use it on columns that appear frequently in WHERE filters or JOIN conditions.

Click to flip back

Question

What is V-ordering in Fabric?

Click or press Enter to reveal answer

Answer

An automatic, proprietary write-time optimization that applies special compression and read-path optimization to Parquet files. Enabled by default in Fabric. Works with all engines (Spark, SQL endpoint, Direct Lake). No configuration needed.

Click to flip back

Knowledge Check

Carlos runs VACUUM on a Delta table with RETAIN 2 HOURS. Later, he tries to time travel to a version from 3 hours ago. The query fails. Why?

Knowledge Check

A lakehouse table has 15,000 small files (average 1.5 MB each) after months of streaming micro-batches. Queries take 40+ seconds. What is the single most impactful optimization?

Next up: Optimize Spark: Speed Up Your Code — partition strategies, caching, broadcast joins, and Spark configuration tuning.