Data Quality & Cleansing

Why does data quality matter?

Simple explanation

Think of data quality like hygiene in a restaurant kitchen.

If your ingredients are contaminated (duplicate records), mislabelled (wrong data types), or missing key components (null values), the meal you serve (your report) will make people sick (wrong decisions). No amount of fancy cooking (DAX measures) can fix bad ingredients.

Data cleansing happens in the Silver layer of the medallion architecture — between raw ingestion (Bronze) and business-ready analytics (Gold). Get this right, and everything downstream works. Get it wrong, and nothing is trustworthy.

Finding and removing duplicates

Detect duplicates (SQL)

-- Find duplicate transactions (same store, product, date, amount)
SELECT store_key, product_key, date_key, total_amount, COUNT(*) AS dup_count
FROM dbo.fact_sales
GROUP BY store_key, product_key, date_key, total_amount
HAVING COUNT(*) > 1;

Remove duplicates (SQL — keep latest)

WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY store_key, product_key, date_key, total_amount
            ORDER BY ingestion_timestamp DESC
        ) AS rn
    FROM dbo.fact_sales
)
DELETE FROM ranked WHERE rn > 1;

Remove duplicates (PySpark)

# Drop exact duplicates
df_deduped = df_raw.dropDuplicates()

# Drop duplicates based on specific columns (keep first)
df_deduped = df_raw.dropDuplicates(["store_id", "product_id", "transaction_date", "amount"])

Scenario: Dr. Sarah's duplicate patient records

Dr. Sarah at Pacific Health discovers that patient lab results are sometimes submitted twice — once by the lab system and once by a manual upload. The duplicate records have the same patient ID, test date, and test type, but different submission timestamps.

She uses ROW_NUMBER() to rank duplicates by submission timestamp and keeps only the earliest record (the original lab submission). This removes 3.2% of records — a significant error rate that was skewing average test results.

Handling null and missing values

Null values need a strategy — ignore, replace, or remove.

Common null strategies

Strategy	When to Use	SQL Example
Replace with default	The column must have a value for calculations	`ISNULL(discount, 0)`
Replace with average	Statistical analysis where nulls would skew results	`COALESCE(price, AVG(price) OVER ())`
Remove the row	The null makes the entire record unusable	`WHERE amount IS NOT NULL`
Flag as unknown	Preserve the row but indicate missing data	`ISNULL(region, 'Unknown')`

PySpark null handling

from pyspark.sql.functions import col, when, lit

# Replace nulls with a default
df_clean = df.fillna({"discount": 0, "region": "Unknown"})

# Drop rows where critical columns are null
df_clean = df.dropna(subset=["patient_id", "test_date"])

# Replace nulls conditionally
df_clean = df.withColumn(
    "unit_price",
    when(col("unit_price").isNull(), lit(0)).otherwise(col("unit_price"))
)

Converting data types

Data arriving from CSV files or APIs often has wrong types — dates stored as strings, numbers stored as text.

SQL type conversion

-- Convert string dates to proper DATE type
SELECT
    CAST(date_string AS DATE) AS transaction_date,
    CAST(amount_string AS DECIMAL(18,2)) AS amount,
    CAST(quantity_string AS INT) AS quantity
FROM staging.raw_transactions;

PySpark type conversion

from pyspark.sql.functions import to_date, col

df_typed = df_raw \
    .withColumn("transaction_date", to_date(col("date_string"), "yyyy-MM-dd")) \
    .withColumn("amount", col("amount_string").cast("decimal(18,2)")) \
    .withColumn("quantity", col("quantity_string").cast("int"))

Exam tip: Type conversion gotchas

Watch for these traps:

String to date: Different source systems use different formats (MM/dd/yyyy vs dd/MM/yyyy vs yyyy-MM-dd). Always specify the format explicitly.
String to number: Commas as thousand separators (1,000.50) or periods as decimal separators (1.000,50 in European formats) cause conversion failures.
Implicit conversion risks: Comparing a string column to a number without explicit CAST can give unexpected results or NULL.
Delta schema enforcement: mergeSchema allows adding new columns during writes. Changing a column’s data type typically requires an explicit cast-and-rewrite operation — mergeSchema alone does not handle type changes.

Filtering data

Filtering removes rows that should not be in your analytical dataset.

Common filter patterns

-- Remove test/dummy records
DELETE FROM dbo.staging_sales WHERE store_key = -1 OR customer_key = 0;

-- Keep only the last 3 years of data
DELETE FROM dbo.fact_sales WHERE date_key < 20230101;

-- Remove outliers (amounts more than 5 standard deviations from mean)
WITH stats AS (
    SELECT AVG(total_amount) AS avg_amt, STDEV(total_amount) AS std_amt
    FROM dbo.fact_sales
)
DELETE f FROM dbo.fact_sales f
CROSS JOIN stats s
WHERE f.total_amount > s.avg_amt + (5 * s.std_amt);

PySpark filtering

# Filter out test records and future dates
from pyspark.sql.functions import current_date

df_filtered = df_raw \
    .filter(col("store_id") != "TEST") \
    .filter(col("transaction_date") <= current_date()) \
    .filter(col("amount") > 0)

Question

What is the best SQL technique for removing duplicate rows while keeping the latest record?

Click or press Enter to reveal answer

Answer

Use ROW_NUMBER() with PARTITION BY (the columns that define a duplicate) and ORDER BY (the column that determines which to keep — e.g., timestamp DESC). Delete rows where ROW_NUMBER > 1.

Click to flip back

Question

Name three strategies for handling null values.

Click or press Enter to reveal answer

Answer

1. Replace with a default (ISNULL, COALESCE). 2. Remove the row (WHERE column IS NOT NULL, dropna). 3. Flag as unknown (replace NULL with 'Unknown' or a sentinel value). Choose based on whether the null makes the row unusable or just incomplete.

Click to flip back

Question

What does schema enforcement do in Delta Lake?

Click or press Enter to reveal answer

Answer

Schema enforcement (also called schema validation) rejects writes that do not match the table's existing schema — wrong data types, missing required columns, or extra columns not in the schema. It prevents bad data from entering the table.

Click to flip back

Knowledge Check

Dr. Sarah at Pacific Health finds that 8% of patient lab results have NULL values in the test_result column. She needs to calculate average test results by department. What should she do?

Knowledge Check

Anita at FreshCart's Bronze table has a date column stored as strings in format 'dd/MM/yyyy'. She needs to convert them to proper DATE type in the Silver table. Some rows have invalid dates ('31/02/2026'). What approach handles this safely?

Next up: Querying with SQL — SELECT, filter, and aggregate data using the Visual Query Editor and T-SQL.