🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 — Module 7 of 10 70%
17 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~14 min read

Auto Loader & Declarative Pipelines

Scalable file ingestion with Auto Loader and production-grade data flows with Lakeflow Spark Declarative Pipelines — the two tools that simplify lakehouse ETL.

Auto Loader

☕ Simple explanation

Auto Loader is a smart mailroom for your data files.

New files land in a storage folder. Auto Loader detects them, processes them, and loads them into a Delta table — automatically, without duplicates, even if millions of files arrive. It uses cloud notifications (Azure Event Grid) to know when new files appear, rather than scanning the entire folder every time.

Auto Loader (cloudFiles) incrementally ingests new data files as they arrive in cloud storage. It uses two file discovery modes: directory listing (scans for new files) and file notification (Azure Event Grid triggers on file creation). It provides schema inference, schema evolution, and exactly-once guarantees via checkpointing.

Auto Loader code pattern

# Auto Loader ingestion
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/schemas/raw_sales")
    .option("header", "true")
    .load("abfss://landing@storage.dfs.core.windows.net/sales/"))

# Write to Delta
(df.writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/raw_sales")
    .trigger(availableNow=True)
    .toTable("bronze.raw_sales"))

File discovery modes

ModeHow It WorksBest For
Directory listing (default)Scans directory for new filesSmall to medium file volumes
File notificationAzure Event Grid notifies on file creationMillions of files, low latency

Auto Loader vs COPY INTO

FeatureAuto LoaderCOPY INTO
File trackingCheckpoint-basedInternal file tracking
StreamingYes (continuous or triggered)No (batch only)
File discoveryDirectory listing + Event GridDirectory listing only
ScaleMillions of filesThousands of files
Schema inferenceAutomatic with evolutionManual or inferSchema

Exam default: Auto Loader is preferred over COPY INTO for most file ingestion scenarios due to better scalability and streaming support.

Lakeflow Spark Declarative Pipelines

☕ Simple explanation

Declarative Pipelines are like a recipe card system for your kitchen.

Instead of writing step-by-step cooking instructions (imperative code), you describe what each dish should look like (declarative). The system figures out the cooking order, handles retries, and validates quality automatically.

You define your bronze, silver, and gold tables as declarations. The pipeline engine handles execution order, dependencies, and data quality checks.

Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT) is a declarative ETL framework. You define tables and views as SQL or Python declarations, and the engine handles dependency resolution, execution order, error recovery, and data quality enforcement. Pipelines support both batch and streaming, and integrate Auto Loader for file ingestion.

SQL Declarative Pipeline

-- Bronze: ingest raw files with Auto Loader
CREATE OR REFRESH STREAMING TABLE bronze_sales
AS SELECT * FROM cloud_files(
  'abfss://landing@storage.dfs.core.windows.net/sales/',
  'csv',
  map('header', 'true')
);

-- Silver: clean and validate
CREATE OR REFRESH STREAMING TABLE silver_sales (
  CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT valid_date EXPECT (sale_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT
  sale_id, customer_id, CAST(amount AS DECIMAL(10,2)) AS amount, sale_date
FROM STREAM(LIVE.bronze_sales);

-- Gold: aggregate for reporting
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
AS SELECT
  sale_date, SUM(amount) AS total_revenue, COUNT(*) AS txn_count
FROM LIVE.silver_sales
GROUP BY sale_date;

Key concepts

ConceptWhat It Does
STREAMING TABLEAppend-only table that processes incrementally
MATERIALIZED VIEWPrecomputed query result, refreshed automatically
LIVE.table_nameReference to another table in the same pipeline
STREAM(LIVE.table)Read streaming changes from an upstream table
ExpectationsData quality constraints (EXPECT, ON VIOLATION)

Pipeline expectations (data quality)

Violation ActionBehaviour
DROP ROWBad rows are silently removed
FAIL UPDATEPipeline fails if any row violates the constraint
(no action)Bad rows are kept, violation is logged in metrics
FeatureDeclarative PipelineNotebook Pipeline
ApproachDeclare WHAT tables look likeWrite HOW to build tables
Dependency managementAutomaticManual (task order)
Data qualityBuilt-in expectationsCustom validation code
Error recoveryAutomatic retryManual retry logic
MonitoringPipeline event log + metricsSpark UI + custom logging
Best forStandard medallion ETLComplex custom logic
Question

What is Auto Loader and when should you use it over COPY INTO?

Click or press Enter to reveal answer

Answer

Auto Loader (cloudFiles) incrementally ingests new files using checkpoints and Event Grid notifications. Use over COPY INTO for millions of files, streaming support, and automatic schema evolution.

Click to flip back

Question

What are the three violation actions for pipeline expectations?

Click or press Enter to reveal answer

Answer

DROP ROW (remove bad rows silently), FAIL UPDATE (pipeline fails on any violation), no action (keep bad rows, log violation in metrics).

Click to flip back

Question

What is the difference between a STREAMING TABLE and a MATERIALIZED VIEW in Declarative Pipelines?

Click or press Enter to reveal answer

Answer

STREAMING TABLE: append-only, processes data incrementally. MATERIALIZED VIEW: precomputed query result, fully recomputed or incrementally refreshed. Use streaming tables for raw/cleaned data, materialized views for aggregates.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Mei Lin receives 50,000 new CSV files daily from Freshmart's 5,000 stores. She needs to ingest them incrementally with zero duplicates and automatic schema evolution. Which tool is best?


Next up: Cleansing & Profiling Data — data profiling, choosing column types, and handling duplicates and nulls.

← Previous

Streaming Ingestion: Structured Streaming & Event Hubs

Next →

Cleansing & Profiling Data

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.