🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 — Module 10 of 10 100%
20 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~13 min read

Data Quality & Schema Enforcement

Implement validation checks, data type verification, schema enforcement, schema drift management, and pipeline expectations — keeping your lakehouse trustworthy.

Data quality in the lakehouse

☕ Simple explanation

Data quality is the food safety inspection for your data kitchen.

Before serving food (data) to customers (analysts, dashboards), you check: Are ingredients fresh (not null)? Are measurements correct (right data types)? Is the recipe followed (schema matches)? Any contamination (invalid values)?

Without quality checks, bad data silently flows to dashboards. Nobody notices until the CEO asks why revenue is negative.

Data quality in Azure Databricks involves four layers: validation checks (nullability, cardinality, range), data type enforcement, schema enforcement/evolution (Delta Lake’s built-in protection), and pipeline expectations (declarative quality constraints in Lakeflow Spark Declarative Pipelines).

Validation checks

Nullability checks

-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders WHERE customer_id IS NULL;

-- Enforce NOT NULL with CHECK constraints
ALTER TABLE silver.orders ADD CONSTRAINT valid_customer
  CHECK (customer_id IS NOT NULL);

Cardinality checks

Cardinality = the number of distinct values. Check that foreign keys actually exist:

-- Referential integrity: all order customer_ids exist in customers
SELECT o.customer_id
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;
-- Should return 0 rows

Range checks

-- Business rule: order amounts must be positive and under $1M
ALTER TABLE silver.orders ADD CONSTRAINT valid_amount
  CHECK (amount > 0 AND amount < 1000000);

-- Date range: no future orders
ALTER TABLE silver.orders ADD CONSTRAINT valid_date
  CHECK (order_date <= CURRENT_DATE());

Schema enforcement and drift

Schema enforcement (default)

Delta Lake rejects writes that don’t match the table schema by default:

# This FAILS if new_df has columns not in the target schema
new_df.write.format("delta").mode("append").saveAsTable("silver.orders")
# Error: "A schema mismatch detected when writing to the Delta table"

Schema evolution (opt-in)

When source schemas legitimately change (new columns added):

# Allow schema evolution for this write
new_df.write \
    .option("mergeSchema", "true") \
    .format("delta").mode("append") \
    .saveAsTable("silver.orders")
-- Or enable globally for a table
ALTER TABLE silver.orders SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');

Managing schema drift

ScenarioStrategy
Source adds a new columnmergeSchema = true (schema evolution)
Source removes a columnLog a warning, investigate — may indicate source issue
Source changes a column typeFail the pipeline — type changes need manual review
Source renames a columnColumn mapping mode required
💡 Exam tip: enforcement vs evolution
  • Schema enforcement (default) = REJECT writes that don’t match schema → protects data integrity
  • Schema evolution (mergeSchema) = ACCEPT new columns → flexibility for changing sources

Exam pattern: If the question says “prevent bad data from entering” → enforcement (default). If it says “accommodate new source columns” → evolution (mergeSchema).

Pipeline expectations

In Lakeflow Spark Declarative Pipelines, expectations are declarative quality constraints:

CREATE OR REFRESH STREAMING TABLE silver_orders (
  -- Warn but keep the row (metrics only)
  CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL),

  -- Drop rows that violate
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,

  -- Fail the entire pipeline update
  CONSTRAINT valid_date EXPECT (order_date <= CURRENT_DATE()) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_orders);
Violation ActionBehaviourUse Case
(no action)Keep row, log violation metricMonitor quality without blocking
DROP ROWRemove bad row silentlyFilter known bad data patterns
FAIL UPDATEStop the pipeline entirelyCritical constraints (e.g., duplicate primary keys)

Expectations are visible in the pipeline’s event log and quality metrics dashboard — you can track the percentage of rows passing each constraint over time.

Question

What is the difference between schema enforcement and schema evolution in Delta Lake?

Click or press Enter to reveal answer

Answer

Schema enforcement (default): rejects writes with mismatched schemas, protecting integrity. Schema evolution (mergeSchema=true): accepts new columns, allowing flexibility. Enforcement protects; evolution adapts.

Click to flip back

Question

What are the three violation actions for pipeline expectations?

Click or press Enter to reveal answer

Answer

No action (keep row, log metric — monitoring), DROP ROW (remove bad rows silently — filtering), FAIL UPDATE (stop pipeline — critical constraints). Choose based on how critical the quality rule is.

Click to flip back

Question

What types of validation checks should you implement?

Click or press Enter to reveal answer

Answer

Nullability (NOT NULL on required columns), cardinality (foreign key existence), range (amount > 0, date not future), data type (correct CAST/conversion), uniqueness (no duplicate primary keys).

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Tomás discovers that NovaPay's source system started sending a new 'fraud_score' column that doesn't exist in the silver.transactions table. The pipeline fails. He wants new columns to be accepted automatically. What should he enable?

Knowledge Check

Mei Lin is building a Declarative Pipeline for Freshmart. She wants to: (1) log a warning if quantity is negative (but keep the row), (2) drop rows where store_id is NULL, and (3) fail the pipeline if any order_id is duplicated. Which expectations should she configure?


Next up: Building Data Pipelines — designing and implementing data pipelines with notebooks and Lakeflow Spark Declarative Pipelines.

← Previous

Transforming & Loading Data

Next →

Building Data Pipelines

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.