Building Data Pipelines
Design order of operations, choose between notebook and Declarative Pipelines, implement task logic, error handling, and build production pipelines.
Designing pipelines
A data pipeline is an assembly line for data.
Raw materials arrive (ingestion), get cleaned (bronze → silver), assembled into products (silver → gold), and shipped (to dashboards). You design the order of stations, decide which machines to use (notebooks or Declarative Pipelines), and plan what happens when something breaks (error handling).
Notebook vs Declarative Pipelines
| Aspect | Notebook Pipeline | Declarative Pipeline |
|---|---|---|
| Approach | Imperative — you code each step | Declarative — you define desired state |
| Dependency management | Manual (Lakeflow Jobs task graph) | Automatic (inferred from LIVE references) |
| Error handling | try/except in code | Built-in retry + expectations |
| Data quality | Custom validation code | Built-in expectations (EXPECT) |
| Compute | Any cluster type | Dedicated pipeline compute (serverless default) |
| Monitoring | Spark UI + custom logging | Pipeline event log + metrics dashboard |
| Best for | Complex logic, ML integration, custom APIs | Standard medallion ETL |
| Exam preference | When 'custom logic' is mentioned | When 'managed' or 'automated quality' is mentioned |
Notebook pipeline with Lakeflow Jobs
A notebook pipeline chains multiple notebooks as tasks in a Lakeflow Job:
Task 1: ingest_raw → Task 2: clean_validate → Task 3: build_gold
(bronze layer) (silver layer) (gold layer)
↘
Task 4: update_dashboard
Precedence constraints
Tasks can have dependencies: Task 3 only runs if Tasks 1 and 2 succeed.
# In a notebook: signal success or failure for downstream tasks
dbutils.notebook.exit("SUCCESS") # signals success to the job
# Or raise an error to fail the task
if error_count > threshold:
raise Exception(f"Data quality failed: {error_count} errors exceeded threshold")
Error handling in notebooks
try:
# Main processing logic
df = spark.read.table("bronze.raw_orders")
clean_df = transform_orders(df)
clean_df.write.mode("append").saveAsTable("silver.orders")
except Exception as e:
# Log the error
print(f"Pipeline failed: {str(e)}")
# Optionally write to an error table
spark.sql(f"INSERT INTO pipeline_errors VALUES (CURRENT_TIMESTAMP(), '{str(e)}')")
# Re-raise to fail the job task
raise
Declarative Pipeline implementation
-- Complete medallion pipeline in Declarative Pipelines
-- Bronze: Auto Loader ingestion
CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT * FROM cloud_files(
'abfss://landing@storage.dfs.core.windows.net/orders/',
'json'
);
-- Silver: cleaned with quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
CONSTRAINT valid_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT
CAST(order_id AS BIGINT) AS order_id,
customer_id,
CAST(amount AS DECIMAL(10,2)) AS amount,
TO_DATE(order_date) AS order_date
FROM STREAM(LIVE.bronze_orders);
-- Gold: materialized view for dashboards
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary
AS SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM LIVE.silver_orders
GROUP BY order_date;
The pipeline engine automatically:
- Determines execution order from
LIVE.references - Handles incremental processing (streaming tables)
- Enforces expectations and logs quality metrics
- Retries on transient failures
Exam decision tree: notebook or declarative?
- Standard bronze → silver → gold ETL? → Declarative Pipeline
- Need custom Python ML models in the pipeline? → Notebook
- Need built-in data quality metrics? → Declarative Pipeline
- Need to call external APIs during processing? → Notebook
- Want automatic dependency management? → Declarative Pipeline
- Need complex control flow (if/else branching)? → Notebook
🎬 Video coming soon
Knowledge check
Dr. Sarah Okafor needs to build a standard bronze → silver → gold ETL pipeline for Athena Group. The pipeline should have built-in data quality checks and automatic dependency management. Which approach should she use?
Next up: Lakeflow Jobs: Create & Configure — creating, configuring, and triggering Lakeflow Jobs.