Data Quality & Cleansing
Dirty data ruins reports. Learn how to find and fix duplicates, missing values, type mismatches, and inconsistencies β with SQL and PySpark examples.
Why does data quality matter?
Think of data quality like hygiene in a restaurant kitchen.
If your ingredients are contaminated (duplicate records), mislabelled (wrong data types), or missing key components (null values), the meal you serve (your report) will make people sick (wrong decisions). No amount of fancy cooking (DAX measures) can fix bad ingredients.
Data cleansing happens in the Silver layer of the medallion architecture β between raw ingestion (Bronze) and business-ready analytics (Gold). Get this right, and everything downstream works. Get it wrong, and nothing is trustworthy.
Finding and removing duplicates
Detect duplicates (SQL)
-- Find duplicate transactions (same store, product, date, amount)
SELECT store_key, product_key, date_key, total_amount, COUNT(*) AS dup_count
FROM dbo.fact_sales
GROUP BY store_key, product_key, date_key, total_amount
HAVING COUNT(*) > 1;
Remove duplicates (SQL β keep latest)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY store_key, product_key, date_key, total_amount
ORDER BY ingestion_timestamp DESC
) AS rn
FROM dbo.fact_sales
)
DELETE FROM ranked WHERE rn > 1;
Remove duplicates (PySpark)
# Drop exact duplicates
df_deduped = df_raw.dropDuplicates()
# Drop duplicates based on specific columns (keep first)
df_deduped = df_raw.dropDuplicates(["store_id", "product_id", "transaction_date", "amount"])
Scenario: Dr. Sarah's duplicate patient records
Dr. Sarah at Pacific Health discovers that patient lab results are sometimes submitted twice β once by the lab system and once by a manual upload. The duplicate records have the same patient ID, test date, and test type, but different submission timestamps.
She uses ROW_NUMBER() to rank duplicates by submission timestamp and keeps only the earliest record (the original lab submission). This removes 3.2% of records β a significant error rate that was skewing average test results.
Handling null and missing values
Null values need a strategy β ignore, replace, or remove.
Common null strategies
| Strategy | When to Use | SQL Example |
|---|---|---|
| Replace with default | The column must have a value for calculations | ISNULL(discount, 0) |
| Replace with average | Statistical analysis where nulls would skew results | COALESCE(price, AVG(price) OVER ()) |
| Remove the row | The null makes the entire record unusable | WHERE amount IS NOT NULL |
| Flag as unknown | Preserve the row but indicate missing data | ISNULL(region, 'Unknown') |
PySpark null handling
from pyspark.sql.functions import col, when, lit
# Replace nulls with a default
df_clean = df.fillna({"discount": 0, "region": "Unknown"})
# Drop rows where critical columns are null
df_clean = df.dropna(subset=["patient_id", "test_date"])
# Replace nulls conditionally
df_clean = df.withColumn(
"unit_price",
when(col("unit_price").isNull(), lit(0)).otherwise(col("unit_price"))
)
Converting data types
Data arriving from CSV files or APIs often has wrong types β dates stored as strings, numbers stored as text.
SQL type conversion
-- Convert string dates to proper DATE type
SELECT
CAST(date_string AS DATE) AS transaction_date,
CAST(amount_string AS DECIMAL(18,2)) AS amount,
CAST(quantity_string AS INT) AS quantity
FROM staging.raw_transactions;
PySpark type conversion
from pyspark.sql.functions import to_date, col
df_typed = df_raw \
.withColumn("transaction_date", to_date(col("date_string"), "yyyy-MM-dd")) \
.withColumn("amount", col("amount_string").cast("decimal(18,2)")) \
.withColumn("quantity", col("quantity_string").cast("int"))
Exam tip: Type conversion gotchas
Watch for these traps:
- String to date: Different source systems use different formats (MM/dd/yyyy vs dd/MM/yyyy vs yyyy-MM-dd). Always specify the format explicitly.
- String to number: Commas as thousand separators (1,000.50) or periods as decimal separators (1.000,50 in European formats) cause conversion failures.
- Implicit conversion risks: Comparing a string column to a number without explicit CAST can give unexpected results or NULL.
- Delta schema enforcement:
mergeSchemaallows adding new columns during writes. Changing a columnβs data type typically requires an explicit cast-and-rewrite operation βmergeSchemaalone does not handle type changes.
Filtering data
Filtering removes rows that should not be in your analytical dataset.
Common filter patterns
-- Remove test/dummy records
DELETE FROM dbo.staging_sales WHERE store_key = -1 OR customer_key = 0;
-- Keep only the last 3 years of data
DELETE FROM dbo.fact_sales WHERE date_key < 20230101;
-- Remove outliers (amounts more than 5 standard deviations from mean)
WITH stats AS (
SELECT AVG(total_amount) AS avg_amt, STDEV(total_amount) AS std_amt
FROM dbo.fact_sales
)
DELETE f FROM dbo.fact_sales f
CROSS JOIN stats s
WHERE f.total_amount > s.avg_amt + (5 * s.std_amt);
PySpark filtering
# Filter out test records and future dates
from pyspark.sql.functions import current_date
df_filtered = df_raw \
.filter(col("store_id") != "TEST") \
.filter(col("transaction_date") <= current_date()) \
.filter(col("amount") > 0)
Dr. Sarah at Pacific Health finds that 8% of patient lab results have NULL values in the test_result column. She needs to calculate average test results by department. What should she do?
Anita at FreshCart's Bronze table has a date column stored as strings in format 'dd/MM/yyyy'. She needs to convert them to proper DATE type in the Silver table. Some rows have invalid dates ('31/02/2026'). What approach handles this safely?
π¬ Video coming soon
Next up: Querying with SQL β SELECT, filter, and aggregate data using the Visual Query Editor and T-SQL.