πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-600 Domain 2
Domain 2 β€” Module 11 of 14 79%
18 of 29 overall

DP-600 Study Guide

Domain 1: Maintain a Data Analytics Solution

  • Workspace Access Controls
  • Row-Level & Object-Level Security
  • Sensitivity Labels & Endorsement
  • Git Version Control in Fabric
  • Deployment Pipelines: Dev β†’ Test β†’ Prod
  • Impact Analysis & Dependencies
  • XMLA Endpoint & Reusable Assets

Domain 2: Prepare Data

  • Microsoft Fabric: The Big Picture Free
  • Lakehouses: Your Data Foundation Free
  • Warehouses in Fabric Free
  • Choosing the Right Data Store Free
  • Data Connections & OneLake Catalog
  • Shortcuts & OneLake Integration
  • Ingesting Data: Dataflows Gen2 & Pipelines
  • Star Schema Design Free
  • SQL Objects: Views, Functions & Stored Procedures
  • Transforming Data: Reshape & Enrich
  • Data Quality & Cleansing
  • Querying with SQL
  • Querying with KQL
  • Querying with DAX

Domain 3: Implement and Manage Semantic Models

  • Semantic Models: Storage Modes
  • Relationships & Advanced Modeling
  • DAX Essentials: Variables & Functions
  • Calculation Groups & Field Parameters
  • Large Models & Composite Models
  • Direct Lake Mode
  • DAX Performance Optimization
  • Incremental Refresh

DP-600 Study Guide

Domain 1: Maintain a Data Analytics Solution

  • Workspace Access Controls
  • Row-Level & Object-Level Security
  • Sensitivity Labels & Endorsement
  • Git Version Control in Fabric
  • Deployment Pipelines: Dev β†’ Test β†’ Prod
  • Impact Analysis & Dependencies
  • XMLA Endpoint & Reusable Assets

Domain 2: Prepare Data

  • Microsoft Fabric: The Big Picture Free
  • Lakehouses: Your Data Foundation Free
  • Warehouses in Fabric Free
  • Choosing the Right Data Store Free
  • Data Connections & OneLake Catalog
  • Shortcuts & OneLake Integration
  • Ingesting Data: Dataflows Gen2 & Pipelines
  • Star Schema Design Free
  • SQL Objects: Views, Functions & Stored Procedures
  • Transforming Data: Reshape & Enrich
  • Data Quality & Cleansing
  • Querying with SQL
  • Querying with KQL
  • Querying with DAX

Domain 3: Implement and Manage Semantic Models

  • Semantic Models: Storage Modes
  • Relationships & Advanced Modeling
  • DAX Essentials: Variables & Functions
  • Calculation Groups & Field Parameters
  • Large Models & Composite Models
  • Direct Lake Mode
  • DAX Performance Optimization
  • Incremental Refresh
Domain 2: Prepare Data Premium ⏱ ~13 min read

Data Quality & Cleansing

Dirty data ruins reports. Learn how to find and fix duplicates, missing values, type mismatches, and inconsistencies β€” with SQL and PySpark examples.

Why does data quality matter?

β˜• Simple explanation

Think of data quality like hygiene in a restaurant kitchen.

If your ingredients are contaminated (duplicate records), mislabelled (wrong data types), or missing key components (null values), the meal you serve (your report) will make people sick (wrong decisions). No amount of fancy cooking (DAX measures) can fix bad ingredients.

Data cleansing happens in the Silver layer of the medallion architecture β€” between raw ingestion (Bronze) and business-ready analytics (Gold). Get this right, and everything downstream works. Get it wrong, and nothing is trustworthy.

Data quality issues β€” duplicates, nulls, type mismatches, invalid values, and inconsistent formats β€” are the most common source of incorrect analytics. The DP-600 exam tests your ability to identify and resolve these issues using SQL, PySpark, and Dataflows Gen2.

In Fabric, data quality is typically enforced at the Silver layer: raw data lands in Bronze tables as-is, Silver tables apply validation rules, and Gold tables contain only clean, trusted data. Delta Lake’s schema enforcement provides a first line of defence, but it does not catch semantic issues (e.g., a valid integer that represents an impossible age).

Finding and removing duplicates

Detect duplicates (SQL)

-- Find duplicate transactions (same store, product, date, amount)
SELECT store_key, product_key, date_key, total_amount, COUNT(*) AS dup_count
FROM dbo.fact_sales
GROUP BY store_key, product_key, date_key, total_amount
HAVING COUNT(*) > 1;

Remove duplicates (SQL β€” keep latest)

WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY store_key, product_key, date_key, total_amount
            ORDER BY ingestion_timestamp DESC
        ) AS rn
    FROM dbo.fact_sales
)
DELETE FROM ranked WHERE rn > 1;

Remove duplicates (PySpark)

# Drop exact duplicates
df_deduped = df_raw.dropDuplicates()

# Drop duplicates based on specific columns (keep first)
df_deduped = df_raw.dropDuplicates(["store_id", "product_id", "transaction_date", "amount"])
πŸ’‘ Scenario: Dr. Sarah's duplicate patient records

Dr. Sarah at Pacific Health discovers that patient lab results are sometimes submitted twice β€” once by the lab system and once by a manual upload. The duplicate records have the same patient ID, test date, and test type, but different submission timestamps.

She uses ROW_NUMBER() to rank duplicates by submission timestamp and keeps only the earliest record (the original lab submission). This removes 3.2% of records β€” a significant error rate that was skewing average test results.

Handling null and missing values

Null values need a strategy β€” ignore, replace, or remove.

Common null strategies

StrategyWhen to UseSQL Example
Replace with defaultThe column must have a value for calculationsISNULL(discount, 0)
Replace with averageStatistical analysis where nulls would skew resultsCOALESCE(price, AVG(price) OVER ())
Remove the rowThe null makes the entire record unusableWHERE amount IS NOT NULL
Flag as unknownPreserve the row but indicate missing dataISNULL(region, 'Unknown')

PySpark null handling

from pyspark.sql.functions import col, when, lit

# Replace nulls with a default
df_clean = df.fillna({"discount": 0, "region": "Unknown"})

# Drop rows where critical columns are null
df_clean = df.dropna(subset=["patient_id", "test_date"])

# Replace nulls conditionally
df_clean = df.withColumn(
    "unit_price",
    when(col("unit_price").isNull(), lit(0)).otherwise(col("unit_price"))
)

Converting data types

Data arriving from CSV files or APIs often has wrong types β€” dates stored as strings, numbers stored as text.

SQL type conversion

-- Convert string dates to proper DATE type
SELECT
    CAST(date_string AS DATE) AS transaction_date,
    CAST(amount_string AS DECIMAL(18,2)) AS amount,
    CAST(quantity_string AS INT) AS quantity
FROM staging.raw_transactions;

PySpark type conversion

from pyspark.sql.functions import to_date, col

df_typed = df_raw \
    .withColumn("transaction_date", to_date(col("date_string"), "yyyy-MM-dd")) \
    .withColumn("amount", col("amount_string").cast("decimal(18,2)")) \
    .withColumn("quantity", col("quantity_string").cast("int"))
πŸ’‘ Exam tip: Type conversion gotchas

Watch for these traps:

  • String to date: Different source systems use different formats (MM/dd/yyyy vs dd/MM/yyyy vs yyyy-MM-dd). Always specify the format explicitly.
  • String to number: Commas as thousand separators (1,000.50) or periods as decimal separators (1.000,50 in European formats) cause conversion failures.
  • Implicit conversion risks: Comparing a string column to a number without explicit CAST can give unexpected results or NULL.
  • Delta schema enforcement: mergeSchema allows adding new columns during writes. Changing a column’s data type typically requires an explicit cast-and-rewrite operation β€” mergeSchema alone does not handle type changes.

Filtering data

Filtering removes rows that should not be in your analytical dataset.

Common filter patterns

-- Remove test/dummy records
DELETE FROM dbo.staging_sales WHERE store_key = -1 OR customer_key = 0;

-- Keep only the last 3 years of data
DELETE FROM dbo.fact_sales WHERE date_key < 20230101;

-- Remove outliers (amounts more than 5 standard deviations from mean)
WITH stats AS (
    SELECT AVG(total_amount) AS avg_amt, STDEV(total_amount) AS std_amt
    FROM dbo.fact_sales
)
DELETE f FROM dbo.fact_sales f
CROSS JOIN stats s
WHERE f.total_amount > s.avg_amt + (5 * s.std_amt);

PySpark filtering

# Filter out test records and future dates
from pyspark.sql.functions import current_date

df_filtered = df_raw \
    .filter(col("store_id") != "TEST") \
    .filter(col("transaction_date") <= current_date()) \
    .filter(col("amount") > 0)
Question

What is the best SQL technique for removing duplicate rows while keeping the latest record?

Click or press Enter to reveal answer

Answer

Use ROW_NUMBER() with PARTITION BY (the columns that define a duplicate) and ORDER BY (the column that determines which to keep β€” e.g., timestamp DESC). Delete rows where ROW_NUMBER > 1.

Click to flip back

Question

Name three strategies for handling null values.

Click or press Enter to reveal answer

Answer

1. Replace with a default (ISNULL, COALESCE). 2. Remove the row (WHERE column IS NOT NULL, dropna). 3. Flag as unknown (replace NULL with 'Unknown' or a sentinel value). Choose based on whether the null makes the row unusable or just incomplete.

Click to flip back

Question

What does schema enforcement do in Delta Lake?

Click or press Enter to reveal answer

Answer

Schema enforcement (also called schema validation) rejects writes that do not match the table's existing schema β€” wrong data types, missing required columns, or extra columns not in the schema. It prevents bad data from entering the table.

Click to flip back

Knowledge Check

Dr. Sarah at Pacific Health finds that 8% of patient lab results have NULL values in the test_result column. She needs to calculate average test results by department. What should she do?

Knowledge Check

Anita at FreshCart's Bronze table has a date column stored as strings in format 'dd/MM/yyyy'. She needs to convert them to proper DATE type in the Silver table. Some rows have invalid dates ('31/02/2026'). What approach handles this safely?

🎬 Video coming soon


Next up: Querying with SQL β€” SELECT, filter, and aggregate data using the Visual Query Editor and T-SQL.

← Previous

Transforming Data: Reshape & Enrich

Next β†’

Querying with SQL

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.