πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 4
Domain 4 β€” Module 8 of 8 100%
28 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 4: Deploy and Maintain Data Pipelines and Workloads Premium ⏱ ~13 min read

Optimizing Delta Tables & Azure Monitor

OPTIMIZE for compaction, VACUUM for cleanup, log streaming to Azure Monitor, and configuring alerts β€” the final operational layer for production lakehouses.

Delta table maintenance

β˜• Simple explanation

OPTIMIZE is defragmenting your hard drive. VACUUM is emptying the recycle bin.

Over time, many small writes create lots of tiny files (fragmentation). OPTIMIZE merges them into larger, more efficient files. VACUUM removes old files that are no longer needed (from deleted data, old versions).

Without regular maintenance, queries slow down and storage costs creep up.

OPTIMIZE compacts small files into larger ones (target: ~1 GB per file) and optionally applies Z-ordering. VACUUM removes files no longer referenced by the current table version, reclaiming storage. Both are essential for production Delta table performance and cost management.

OPTIMIZE

-- Compact small files into larger ones
OPTIMIZE prod_sales.curated.daily_revenue;

-- Compact with Z-ordering on specific columns
OPTIMIZE prod_sales.curated.daily_revenue
  ZORDER BY (region, product_category);

What OPTIMIZE does:

  1. Identifies files smaller than the target size (~1 GB)
  2. Reads the small files
  3. Writes new, larger files
  4. Updates the Delta log to point to new files
  5. Old small files remain until VACUUM removes them

When to run OPTIMIZE

ScenarioFrequency
Table with frequent small appends (streaming)After each batch or on a schedule (hourly/daily)
Table with infrequent large writesRarely needed
After MERGE operationsAfter each MERGE (creates small files)

Predictive optimization

Databricks can automatically optimise tables:

-- Enable auto-optimization
ALTER TABLE prod_sales.curated.daily_revenue
  SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
  );
  • optimizeWrite: coalesces small output files during writes
  • autoCompact: runs OPTIMIZE automatically after writes

VACUUM

-- Remove files older than 7 days (default retention)
VACUUM prod_sales.curated.daily_revenue;

-- Remove files older than 30 days
VACUUM prod_sales.curated.daily_revenue RETAIN 30 HOURS;

-- Dry run: see what would be deleted without deleting
VACUUM prod_sales.curated.daily_revenue DRY RUN;

Safety rules:

  • Default retention: 7 days (168 hours)
  • Setting retention below 7 days requires disabling the safety check
  • VACUUMed files cannot be recovered β€” time travel to deleted versions becomes impossible
OPTIMIZEVACUUM
Compacts small files into larger onesDeletes old, unreferenced files
Improves READ performanceReduces STORAGE cost
Creates new files (doesn’t delete old)Deletes old files
Safe to run anytimeIrreversible β€” breaks time travel for deleted versions
Run frequently for streaming tablesRun on a schedule (daily/weekly)

Azure Monitor integration

Log streaming

Stream Databricks logs to Azure Log Analytics for centralised monitoring:

  1. Configure diagnostic settings on the Databricks workspace
  2. Select log categories to stream:
    • Cluster events (start, stop, resize)
    • Job run events (start, complete, fail)
    • Notebook events (execution logs)
  3. Destination: Log Analytics workspace

Once configured, query Databricks logs with KQL (Kusto Query Language) in Log Analytics:

// Find failed jobs in the last 24 hours
DatabricksJobs
| where TimeGenerated > ago(24h)
| where ActionName == "runFailed"
| project TimeGenerated, JobId, RunName, ErrorMessage
| order by TimeGenerated desc

Azure Monitor alerts

Configure alerts that fire when conditions are met:

Alert ConditionWhat It Monitors
Job failure count > 0Any pipeline failure
Cluster CPU > 90% for 10 minutesResource bottleneck
DBU consumption > daily budgetCost overrun
No job runs in 2 hoursMissed scheduled runs
Alert flow:
Databricks logs β†’ Log Analytics β†’ Alert Rule β†’ Action Group β†’ Notification
                                                (email, Teams, PagerDuty)

Ravi sets up Azure Monitor alerts at DataPulse: if any nightly ETL job fails or if daily DBU consumption exceeds the budget, the team gets an immediate Teams notification.

Question

What is the difference between OPTIMIZE and VACUUM?

Click or press Enter to reveal answer

Answer

OPTIMIZE compacts small files into larger ones (improves read performance). VACUUM deletes old unreferenced files (reduces storage cost). OPTIMIZE creates new files; VACUUM removes old ones.

Click to flip back

Question

Why is VACUUM's default retention 7 days?

Click or press Enter to reveal answer

Answer

The 7-day retention protects time travel and concurrent readers. Files younger than 7 days might still be referenced by active queries or time travel requests. Reducing below 7 days risks breaking these operations.

Click to flip back

Question

How do you stream Databricks logs to Azure Monitor?

Click or press Enter to reveal answer

Answer

Configure diagnostic settings on the Databricks workspace, select log categories (jobs, clusters, notebooks), and set the destination to a Log Analytics workspace. Query with KQL in Log Analytics.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Mei Lin's Freshmart streaming pipeline appends thousands of small files per hour to a Delta table. Query performance has degraded significantly. What should she run?

Knowledge Check

Dr. Sarah Okafor wants to be alerted within 5 minutes whenever any production Databricks job fails at Athena Group. The alert should go to the on-call team's Microsoft Teams channel. What should she configure?


πŸŽ‰ Congratulations! You’ve completed all 28 modules of the DP-750 study guide. Ready to test your knowledge? Head to the Practice Questions to prepare for exam day.

← Previous

Spark Performance: DAG & Query Profile

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.