Optimizing Delta Tables & Azure Monitor
OPTIMIZE for compaction, VACUUM for cleanup, log streaming to Azure Monitor, and configuring alerts β the final operational layer for production lakehouses.
Delta table maintenance
OPTIMIZE is defragmenting your hard drive. VACUUM is emptying the recycle bin.
Over time, many small writes create lots of tiny files (fragmentation). OPTIMIZE merges them into larger, more efficient files. VACUUM removes old files that are no longer needed (from deleted data, old versions).
Without regular maintenance, queries slow down and storage costs creep up.
OPTIMIZE
-- Compact small files into larger ones
OPTIMIZE prod_sales.curated.daily_revenue;
-- Compact with Z-ordering on specific columns
OPTIMIZE prod_sales.curated.daily_revenue
ZORDER BY (region, product_category);
What OPTIMIZE does:
- Identifies files smaller than the target size (~1 GB)
- Reads the small files
- Writes new, larger files
- Updates the Delta log to point to new files
- Old small files remain until VACUUM removes them
When to run OPTIMIZE
| Scenario | Frequency |
|---|---|
| Table with frequent small appends (streaming) | After each batch or on a schedule (hourly/daily) |
| Table with infrequent large writes | Rarely needed |
| After MERGE operations | After each MERGE (creates small files) |
Predictive optimization
Databricks can automatically optimise tables:
-- Enable auto-optimization
ALTER TABLE prod_sales.curated.daily_revenue
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
- optimizeWrite: coalesces small output files during writes
- autoCompact: runs OPTIMIZE automatically after writes
VACUUM
-- Remove files older than 7 days (default retention)
VACUUM prod_sales.curated.daily_revenue;
-- Remove files older than 30 days
VACUUM prod_sales.curated.daily_revenue RETAIN 30 HOURS;
-- Dry run: see what would be deleted without deleting
VACUUM prod_sales.curated.daily_revenue DRY RUN;
Safety rules:
- Default retention: 7 days (168 hours)
- Setting retention below 7 days requires disabling the safety check
- VACUUMed files cannot be recovered β time travel to deleted versions becomes impossible
| OPTIMIZE | VACUUM |
|---|---|
| Compacts small files into larger ones | Deletes old, unreferenced files |
| Improves READ performance | Reduces STORAGE cost |
| Creates new files (doesnβt delete old) | Deletes old files |
| Safe to run anytime | Irreversible β breaks time travel for deleted versions |
| Run frequently for streaming tables | Run on a schedule (daily/weekly) |
Azure Monitor integration
Log streaming
Stream Databricks logs to Azure Log Analytics for centralised monitoring:
- Configure diagnostic settings on the Databricks workspace
- Select log categories to stream:
- Cluster events (start, stop, resize)
- Job run events (start, complete, fail)
- Notebook events (execution logs)
- Destination: Log Analytics workspace
Once configured, query Databricks logs with KQL (Kusto Query Language) in Log Analytics:
// Find failed jobs in the last 24 hours
DatabricksJobs
| where TimeGenerated > ago(24h)
| where ActionName == "runFailed"
| project TimeGenerated, JobId, RunName, ErrorMessage
| order by TimeGenerated desc
Azure Monitor alerts
Configure alerts that fire when conditions are met:
| Alert Condition | What It Monitors |
|---|---|
| Job failure count > 0 | Any pipeline failure |
| Cluster CPU > 90% for 10 minutes | Resource bottleneck |
| DBU consumption > daily budget | Cost overrun |
| No job runs in 2 hours | Missed scheduled runs |
Alert flow:
Databricks logs β Log Analytics β Alert Rule β Action Group β Notification
(email, Teams, PagerDuty)
Ravi sets up Azure Monitor alerts at DataPulse: if any nightly ETL job fails or if daily DBU consumption exceeds the budget, the team gets an immediate Teams notification.
π¬ Video coming soon
Knowledge check
Mei Lin's Freshmart streaming pipeline appends thousands of small files per hour to a Delta table. Query performance has degraded significantly. What should she run?
Dr. Sarah Okafor wants to be alerted within 5 minutes whenever any production Databricks job fails at Athena Group. The alert should go to the on-call team's Microsoft Teams channel. What should she configure?
π Congratulations! Youβve completed all 28 modules of the DP-750 study guide. Ready to test your knowledge? Head to the Practice Questions to prepare for exam day.