Data Quality & Schema Enforcement

Data quality in the lakehouse

Simple explanation

Data quality is the food safety inspection for your data kitchen.

Before serving food (data) to customers (analysts, dashboards), you check: Are ingredients fresh (not null)? Are measurements correct (right data types)? Is the recipe followed (schema matches)? Any contamination (invalid values)?

Without quality checks, bad data silently flows to dashboards. Nobody notices until the CEO asks why revenue is negative.

Validation checks

Nullability checks

-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders WHERE customer_id IS NULL;

-- Enforce NOT NULL with CHECK constraints
ALTER TABLE silver.orders ADD CONSTRAINT valid_customer
  CHECK (customer_id IS NOT NULL);

Cardinality checks

Cardinality = the number of distinct values. Check that foreign keys actually exist:

-- Referential integrity: all order customer_ids exist in customers
SELECT o.customer_id
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;
-- Should return 0 rows

Range checks

-- Business rule: order amounts must be positive and under $1M
ALTER TABLE silver.orders ADD CONSTRAINT valid_amount
  CHECK (amount > 0 AND amount < 1000000);

-- Date range: no future orders
ALTER TABLE silver.orders ADD CONSTRAINT valid_date
  CHECK (order_date <= CURRENT_DATE());

Schema enforcement and drift

Schema enforcement (default)

Delta Lake rejects writes that don’t match the table schema by default:

# This FAILS if new_df has columns not in the target schema
new_df.write.format("delta").mode("append").saveAsTable("silver.orders")
# Error: "A schema mismatch detected when writing to the Delta table"

Schema evolution (opt-in)

When source schemas legitimately change (new columns added):

# Allow schema evolution for this write
new_df.write \
    .option("mergeSchema", "true") \
    .format("delta").mode("append") \
    .saveAsTable("silver.orders")

-- Or enable schema merging session-wide
-- spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
ALTER TABLE silver.orders SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name');

The session-wide setting for automatic schema merging is:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Managing schema drift

Scenario	Strategy
Source adds a new column	`mergeSchema = true` (schema evolution)
Source removes a column	Log a warning, investigate — may indicate source issue
Source changes a column type	Fail the pipeline — type changes need manual review
Source renames a column	Column mapping mode required

Exam tip: enforcement vs evolution

Schema enforcement (default) = REJECT writes that don’t match schema → protects data integrity
Schema evolution (mergeSchema) = ACCEPT new columns → flexibility for changing sources

Exam pattern: If the question says “prevent bad data from entering” → enforcement (default). If it says “accommodate new source columns” → evolution (mergeSchema).

Pipeline expectations

In Lakeflow Spark Declarative Pipelines, expectations are declarative quality constraints:

CREATE OR REFRESH STREAMING TABLE silver_orders (
  -- Warn but keep the row (metrics only)
  CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL),

  -- Drop rows that violate
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,

  -- Fail the entire pipeline update
  CONSTRAINT valid_date EXPECT (order_date <= CURRENT_DATE()) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_orders);

Violation Action	Behaviour	Use Case
(no action)	Keep row, log violation metric	Monitor quality without blocking
DROP ROW	Remove bad row silently	Filter known bad data patterns
FAIL UPDATE	Stop the pipeline entirely	Critical constraints (e.g., duplicate primary keys)

Expectations are visible in the pipeline’s event log and quality metrics dashboard — you can track the percentage of rows passing each constraint over time.

Question

What is the difference between schema enforcement and schema evolution in Delta Lake?

Click or press Enter to reveal answer

Answer

Schema enforcement (default): rejects writes with mismatched schemas, protecting integrity. Schema evolution (mergeSchema=true): accepts new columns, allowing flexibility. Enforcement protects; evolution adapts.

Click to flip back

Question

What are the three violation actions for pipeline expectations?

Click or press Enter to reveal answer

Answer

No action (keep row, log metric — monitoring), DROP ROW (remove bad rows silently — filtering), FAIL UPDATE (stop pipeline — critical constraints). Choose based on how critical the quality rule is.

Click to flip back

Question

What types of validation checks should you implement?

Click or press Enter to reveal answer

Answer

Nullability (NOT NULL on required columns), cardinality (foreign key existence), range (amount > 0, date not future), data type (correct CAST/conversion), uniqueness (no duplicate primary keys).

Click to flip back

Knowledge check

Knowledge Check

Tomás discovers that NovaPay's source system started sending a new 'fraud_score' column that doesn't exist in the silver.transactions table. The pipeline fails. He wants new columns to be accepted automatically. What should he enable?

Knowledge Check

Mei Lin is building a Declarative Pipeline for Freshmart. She wants to: (1) log a warning if quantity is negative (but keep the row), (2) drop rows where store_id is NULL, and (3) fail the pipeline if any order_id is duplicated. Which expectations should she configure?

Next up: Building Data Pipelines — designing and implementing data pipelines with notebooks and Lakeflow Spark Declarative Pipelines.