Data Quality & Schema Enforcement
Implement validation checks, data type verification, schema enforcement, schema drift management, and pipeline expectations — keeping your lakehouse trustworthy.
Data quality in the lakehouse
Data quality is the food safety inspection for your data kitchen.
Before serving food (data) to customers (analysts, dashboards), you check: Are ingredients fresh (not null)? Are measurements correct (right data types)? Is the recipe followed (schema matches)? Any contamination (invalid values)?
Without quality checks, bad data silently flows to dashboards. Nobody notices until the CEO asks why revenue is negative.
Validation checks
Nullability checks
-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders WHERE customer_id IS NULL;
-- Enforce NOT NULL with CHECK constraints
ALTER TABLE silver.orders ADD CONSTRAINT valid_customer
CHECK (customer_id IS NOT NULL);
Cardinality checks
Cardinality = the number of distinct values. Check that foreign keys actually exist:
-- Referential integrity: all order customer_ids exist in customers
SELECT o.customer_id
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;
-- Should return 0 rows
Range checks
-- Business rule: order amounts must be positive and under $1M
ALTER TABLE silver.orders ADD CONSTRAINT valid_amount
CHECK (amount > 0 AND amount < 1000000);
-- Date range: no future orders
ALTER TABLE silver.orders ADD CONSTRAINT valid_date
CHECK (order_date <= CURRENT_DATE());
Schema enforcement and drift
Schema enforcement (default)
Delta Lake rejects writes that don’t match the table schema by default:
# This FAILS if new_df has columns not in the target schema
new_df.write.format("delta").mode("append").saveAsTable("silver.orders")
# Error: "A schema mismatch detected when writing to the Delta table"
Schema evolution (opt-in)
When source schemas legitimately change (new columns added):
# Allow schema evolution for this write
new_df.write \
.option("mergeSchema", "true") \
.format("delta").mode("append") \
.saveAsTable("silver.orders")
-- Or enable globally for a table
ALTER TABLE silver.orders SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');
Managing schema drift
| Scenario | Strategy |
|---|---|
| Source adds a new column | mergeSchema = true (schema evolution) |
| Source removes a column | Log a warning, investigate — may indicate source issue |
| Source changes a column type | Fail the pipeline — type changes need manual review |
| Source renames a column | Column mapping mode required |
Exam tip: enforcement vs evolution
- Schema enforcement (default) = REJECT writes that don’t match schema → protects data integrity
- Schema evolution (mergeSchema) = ACCEPT new columns → flexibility for changing sources
Exam pattern: If the question says “prevent bad data from entering” → enforcement (default). If it says “accommodate new source columns” → evolution (mergeSchema).
Pipeline expectations
In Lakeflow Spark Declarative Pipelines, expectations are declarative quality constraints:
CREATE OR REFRESH STREAMING TABLE silver_orders (
-- Warn but keep the row (metrics only)
CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL),
-- Drop rows that violate
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
-- Fail the entire pipeline update
CONSTRAINT valid_date EXPECT (order_date <= CURRENT_DATE()) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_orders);
| Violation Action | Behaviour | Use Case |
|---|---|---|
| (no action) | Keep row, log violation metric | Monitor quality without blocking |
| DROP ROW | Remove bad row silently | Filter known bad data patterns |
| FAIL UPDATE | Stop the pipeline entirely | Critical constraints (e.g., duplicate primary keys) |
Expectations are visible in the pipeline’s event log and quality metrics dashboard — you can track the percentage of rows passing each constraint over time.
🎬 Video coming soon
Knowledge check
Tomás discovers that NovaPay's source system started sending a new 'fraud_score' column that doesn't exist in the silver.transactions table. The pipeline fails. He wants new columns to be accepted automatically. What should he enable?
Mei Lin is building a Declarative Pipeline for Freshmart. She wants to: (1) log a warning if quantity is negative (but keep the row), (2) drop rows where store_id is NULL, and (3) fail the pipeline if any order_id is duplicated. Which expectations should she configure?
Next up: Building Data Pipelines — designing and implementing data pipelines with notebooks and Lakeflow Spark Declarative Pipelines.