Auto Loader & Declarative Pipelines
Scalable file ingestion with Auto Loader and production-grade data flows with Lakeflow Spark Declarative Pipelines — the two tools that simplify lakehouse ETL.
Auto Loader
Auto Loader is a smart mailroom for your data files.
New files land in a storage folder. Auto Loader detects them, processes them, and loads them into a Delta table — automatically, without duplicates, even if millions of files arrive. It uses cloud notifications (Azure Event Grid) to know when new files appear, rather than scanning the entire folder every time.
Auto Loader code pattern
# Auto Loader ingestion
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/schemas/raw_sales")
.option("header", "true")
.load("abfss://landing@storage.dfs.core.windows.net/sales/"))
# Write to Delta
(df.writeStream
.format("delta")
.option("checkpointLocation", "/checkpoints/raw_sales")
.trigger(availableNow=True)
.toTable("bronze.raw_sales"))
File discovery modes
| Mode | How It Works | Best For |
|---|---|---|
| Directory listing (default) | Scans directory for new files | Small to medium file volumes |
| File notification | Azure Event Grid notifies on file creation | Millions of files, low latency |
Auto Loader vs COPY INTO
| Feature | Auto Loader | COPY INTO |
|---|---|---|
| File tracking | Checkpoint-based | Internal file tracking |
| Streaming | Yes (continuous or triggered) | No (batch only) |
| File discovery | Directory listing + Event Grid | Directory listing only |
| Scale | Millions of files | Thousands of files |
| Schema inference | Automatic with evolution | Manual or inferSchema |
Exam default: Auto Loader is preferred over COPY INTO for most file ingestion scenarios due to better scalability and streaming support.
Lakeflow Spark Declarative Pipelines
Declarative Pipelines are like a recipe card system for your kitchen.
Instead of writing step-by-step cooking instructions (imperative code), you describe what each dish should look like (declarative). The system figures out the cooking order, handles retries, and validates quality automatically.
You define your bronze, silver, and gold tables as declarations. The pipeline engine handles execution order, dependencies, and data quality checks.
SQL Declarative Pipeline
-- Bronze: ingest raw files with Auto Loader
CREATE OR REFRESH STREAMING TABLE bronze_sales
AS SELECT * FROM cloud_files(
'abfss://landing@storage.dfs.core.windows.net/sales/',
'csv',
map('header', 'true')
);
-- Silver: clean and validate
CREATE OR REFRESH STREAMING TABLE silver_sales (
CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
CONSTRAINT valid_date EXPECT (sale_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT
sale_id, customer_id, CAST(amount AS DECIMAL(10,2)) AS amount, sale_date
FROM STREAM(LIVE.bronze_sales);
-- Gold: aggregate for reporting
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
AS SELECT
sale_date, SUM(amount) AS total_revenue, COUNT(*) AS txn_count
FROM LIVE.silver_sales
GROUP BY sale_date;
Key concepts
| Concept | What It Does |
|---|---|
| STREAMING TABLE | Append-only table that processes incrementally |
| MATERIALIZED VIEW | Precomputed query result, refreshed automatically |
| LIVE.table_name | Reference to another table in the same pipeline |
| STREAM(LIVE.table) | Read streaming changes from an upstream table |
| Expectations | Data quality constraints (EXPECT, ON VIOLATION) |
Pipeline expectations (data quality)
| Violation Action | Behaviour |
|---|---|
| DROP ROW | Bad rows are silently removed |
| FAIL UPDATE | Pipeline fails if any row violates the constraint |
| (no action) | Bad rows are kept, violation is logged in metrics |
| Feature | Declarative Pipeline | Notebook Pipeline |
|---|---|---|
| Approach | Declare WHAT tables look like | Write HOW to build tables |
| Dependency management | Automatic | Manual (task order) |
| Data quality | Built-in expectations | Custom validation code |
| Error recovery | Automatic retry | Manual retry logic |
| Monitoring | Pipeline event log + metrics | Spark UI + custom logging |
| Best for | Standard medallion ETL | Complex custom logic |
🎬 Video coming soon
Knowledge check
Mei Lin receives 50,000 new CSV files daily from Freshmart's 5,000 stores. She needs to ingest them incrementally with zero duplicates and automatic schema evolution. Which tool is best?
Next up: Cleansing & Profiling Data — data profiling, choosing column types, and handling duplicates and nulls.