Optimize Lakehouse Tables: Delta Tuning
Use OPTIMIZE, VACUUM, Z-ordering, and V-ordering to make your Delta Lake tables fast, compact, and storage-efficient.
Why optimize Delta tables?
Think of a filing cabinet that nobody organises.
After months of use, you have thousands of tiny folders (small files), old documents that should have been shredded (orphaned files), and no logical order to anything (unsorted data). Finding something takes forever.
Delta table optimization is like hiring a professional organiser: OPTIMIZE combines tiny files into bigger ones (compaction). VACUUM removes old files that are no longer needed. Z-ordering sorts data so queries on common columns skip irrelevant files. V-ordering is Fabric’s special compression for even faster reads.
OPTIMIZE (compaction)
Combines many small files into fewer larger files. This is the single most impactful optimization.
-- Compact all files in the table
OPTIMIZE lakehouse.FactOrders
-- Compact with Z-ordering on a specific column
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date)
-- Compact a specific partition
OPTIMIZE lakehouse.FactOrders WHERE year = 2026
Before and after
| Metric | Before OPTIMIZE | After OPTIMIZE |
|---|---|---|
| Number of files | 12,847 small files | 156 optimally-sized files |
| Average file size | 2 MB | 128 MB |
| Query scan time | 45 seconds | 6 seconds |
Why small files are a problem
Each Parquet file has overhead: file open, metadata read, footer parsing. With 12,847 files, Spark spends more time on overhead than actual data reading.
OPTIMIZE reads all the small files and rewrites them as fewer, optimally-sized files (target: ~128 MB each). The data is identical — only the physical file layout changes.
Best practice: Run OPTIMIZE after batch loads, especially after many small appends or streaming micro-batches.
VACUUM (cleanup)
Removes files that are no longer referenced by the transaction log. These are leftover from UPDATE, DELETE, MERGE, or OPTIMIZE operations.
-- Remove files older than 7 days (default retention)
VACUUM lakehouse.FactOrders
-- Remove files older than 24 hours (careful — reduces time travel window)
VACUUM lakehouse.FactOrders RETAIN 24 HOURS
What VACUUM does NOT delete
- Current data files (referenced by the latest transaction log)
- Files within the retention period (default: 7 days)
- The transaction log itself (_delta_log/)
Exam tip: VACUUM and time travel
Critical trade-off: VACUUM removes old file versions. If you VACUUM with a 24-hour retention, you can only time travel back 24 hours. Files needed for older versions are gone.
Exam pattern: “After running VACUUM, time travel to version 5 fails.” Why? VACUUM deleted the files that version 5 referenced.
Safe default: Keep the 7-day retention period unless storage cost is critical.
Z-ordering
Z-ordering physically sorts data within files by specified columns. This enables data skipping — Spark reads file metadata (min/max values per column) and skips files that can’t contain matching rows.
-- Z-order by the most commonly filtered column
OPTIMIZE lakehouse.FactOrders ZORDER BY (customer_id)
-- Z-order by multiple columns (first column gets priority)
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date, region)
When to Z-order
| Scenario | Z-Order Column | Why |
|---|---|---|
| Queries filter by date | order_date | Date range queries skip non-matching files |
| Queries filter by customer | customer_id | Customer lookups read fewer files |
| Queries join on product_id | product_id | Join scans only relevant files |
Scenario: Carlos Z-orders production data
Carlos’s FactProduction table has 500M rows. Most queries filter by factory_id and production_date. Without Z-ordering, every query scans all files.
After running OPTIMIZE FactProduction ZORDER BY (production_date, factory_id):
- A query for “Factory F01, April 2026” scans 12 files instead of 1,500
- Query time drops from 35 seconds to 3 seconds
V-ordering
V-ordering is Fabric’s proprietary write-time optimization that applies special sorting and compression to Parquet files. It’s enabled by default for Fabric lakehouse tables.
| Feature | Z-ordering | V-ordering |
|---|---|---|
| When applied | Manually with OPTIMIZE command | Automatically on write (default in Fabric) |
| What it does | Sorts data by specific columns for data skipping | Applies optimized compression and read-path optimization |
| Configured by | You choose the columns | Enabled explicitly via table properties, session config, or OPTIMIZE VORDER option |
| Works with | Spark engine (data skipping) | All Fabric engines (Spark, SQL endpoint, Power BI Direct Lake) |
| Relationship | Can be combined with V-ordering | Complements Z-ordering |
Table maintenance strategy
| Task | Frequency | Command |
|---|---|---|
| OPTIMIZE | After batch loads, daily for heavy tables | OPTIMIZE table_name |
| OPTIMIZE + Z-ORDER | Weekly or after major loads | OPTIMIZE table ZORDER BY (column) |
| VACUUM | Weekly | VACUUM table_name |
| Analyze table | After major schema or data changes | ANALYZE TABLE table_name COMPUTE STATISTICS |
Carlos runs VACUUM on a Delta table with RETAIN 2 HOURS. Later, he tries to time travel to a version from 3 hours ago. The query fails. Why?
A lakehouse table has 15,000 small files (average 1.5 MB each) after months of streaming micro-batches. Queries take 40+ seconds. What is the single most impactful optimization?
🎬 Video coming soon
Next up: Optimize Spark: Speed Up Your Code — partition strategies, caching, broadcast joins, and Spark configuration tuning.