Building Data Pipelines

Designing pipelines

Simple explanation

A data pipeline is an assembly line for data.

Raw materials arrive (ingestion), get cleaned (bronze → silver), assembled into products (silver → gold), and shipped (to dashboards). You design the order of stations, decide which machines to use (notebooks or Declarative Pipelines), and plan what happens when something breaks (error handling).

Notebook vs Declarative Pipelines

Aspect	Notebook Pipeline	Declarative Pipeline
Approach	Imperative — you code each step	Declarative — you define desired state
Dependency management	Manual (Lakeflow Jobs task graph)	Automatic (inferred from LIVE references)
Error handling	try/except in code	Built-in retry + expectations
Data quality	Custom validation code	Built-in expectations (EXPECT)
Compute	Any cluster type	Dedicated pipeline compute (serverless default)
Monitoring	Spark UI + custom logging	Pipeline event log + metrics dashboard
Best for	Complex logic, ML integration, custom APIs	Standard medallion ETL
Exam preference	When 'custom logic' is mentioned	When 'managed' or 'automated quality' is mentioned

Notebook pipeline with Lakeflow Jobs

A notebook pipeline chains multiple notebooks as tasks in a Lakeflow Job:

Task 1: ingest_raw       →  Task 2: clean_validate  →  Task 3: build_gold
(bronze layer)                (silver layer)              (gold layer)
                                                            ↘
                                                   Task 4: update_dashboard

Precedence constraints

Tasks can have dependencies: Task 3 only runs if Tasks 1 and 2 succeed.

# In a notebook: return an output value to the job
dbutils.notebook.exit("row_count=15000")  # returns a value; task succeeds on normal completion

# Or raise an error to fail the task
if error_count > threshold:
    raise Exception(f"Data quality failed: {error_count} errors exceeded threshold")

Error handling in notebooks

try:
    # Main processing logic
    df = spark.read.table("bronze.raw_orders")
    clean_df = transform_orders(df)
    clean_df.write.mode("append").saveAsTable("silver.orders")

except Exception as e:
    # Log the error
    print(f"Pipeline failed: {str(e)}")
    # Optionally write to an error table
    spark.sql(f"INSERT INTO pipeline_errors VALUES (CURRENT_TIMESTAMP(), '{str(e)}')")
    # Re-raise to fail the job task
    raise

Declarative Pipeline implementation

-- Complete medallion pipeline in Declarative Pipelines

-- Bronze: Auto Loader ingestion
CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT * FROM cloud_files(
  'abfss://landing@storage.dfs.core.windows.net/orders/',
  'json'
);

-- Silver: cleaned with quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
  CONSTRAINT valid_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT
  CAST(order_id AS BIGINT) AS order_id,
  customer_id,
  CAST(amount AS DECIMAL(10,2)) AS amount,
  TO_DATE(order_date) AS order_date
FROM STREAM(LIVE.bronze_orders);

-- Gold: materialized view for dashboards
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary
AS SELECT
  order_date,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue
FROM LIVE.silver_orders
GROUP BY order_date;

The pipeline engine automatically:

Determines execution order from LIVE. references
Handles incremental processing (streaming tables)
Enforces expectations and logs quality metrics
Retries on transient failures

Exam decision tree: notebook or declarative?

Standard bronze → silver → gold ETL? → Declarative Pipeline
Need custom Python ML models in the pipeline? → Notebook
Need built-in data quality metrics? → Declarative Pipeline
Need to call external APIs during processing? → Notebook
Want automatic dependency management? → Declarative Pipeline
Need complex control flow (if/else branching)? → Notebook

Question

When should you use a notebook pipeline vs a Declarative Pipeline?

Click or press Enter to reveal answer

Answer

Notebook: complex logic, ML integration, external APIs, branching control flow. Declarative: standard medallion ETL, automatic dependencies, built-in quality expectations, managed compute.

Click to flip back

Question

How do Declarative Pipelines handle dependency ordering?

Click or press Enter to reveal answer

Answer

Automatically — the engine infers dependencies from LIVE. references. If silver_orders reads from LIVE.bronze_orders, the engine runs bronze first. No manual dependency graph needed.

Click to flip back

Question

How should you handle errors in notebook pipelines?

Click or press Enter to reveal answer

Answer

Use try/except blocks, log errors to an error table, and re-raise the exception to fail the job task. Use dbutils.notebook.exit('SUCCESS') to signal success for downstream task dependencies.

Click to flip back

Knowledge check

Knowledge Check

Dr. Sarah Okafor needs to build a standard bronze → silver → gold ETL pipeline for Athena Group. The pipeline should have built-in data quality checks and automatic dependency management. Which approach should she use?

Next up: Lakeflow Jobs: Create & Configure — creating, configuring, and triggering Lakeflow Jobs.