Cleansing & Profiling Data

Why profiling comes first

Simple explanation

You wouldn’t cook with ingredients you haven’t inspected.

Data profiling is checking your ingredients before you start cooking. How many rows? Any spoiled items (nulls)? Any duplicates (two identical tomatoes)? Are the measurements right (correct data types)?

Skipping this step leads to pipelines that silently produce wrong results — the most dangerous kind of bug.

Data profiling techniques

# Quick profile with describe()
df.describe().show()

# Detailed profiling with dbutils
dbutils.data.summarize(df)  # Databricks-specific: visual summary

# Custom profiling queries
from pyspark.sql.functions import count, countDistinct, sum as _sum, col, isnan, when

profile = df.select([
    count("*").alias("total_rows"),
    countDistinct("customer_id").alias("unique_customers"),
    _sum(when(col("email").isNull(), 1).otherwise(0)).alias("null_emails"),
    _sum(when(col("amount") < 0, 1).otherwise(0)).alias("negative_amounts")
])
profile.show()

-- SQL profiling
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(*) - COUNT(email) AS null_emails,
  MIN(amount) AS min_amount,
  MAX(amount) AS max_amount,
  AVG(amount) AS avg_amount
FROM bronze.raw_orders;

Choosing column data types

Data	Wrong Type	Right Type	Why It Matters
Currency amounts	STRING	DECIMAL(12,2)	String arithmetic fails; precision matters for money
Dates	STRING (“2026-04-21”)	DATE or TIMESTAMP	String dates can’t be compared, aggregated, or filtered properly
IDs	INTEGER	BIGINT or STRING	Integer IDs overflow; some IDs have leading zeros
Boolean flags	STRING (“Y”/“N”)	BOOLEAN	Cleaner filters, smaller storage
Postcodes	INTEGER	STRING	Leading zeros are lost in numeric types (e.g., “01onal” → 1)

-- Cast types during the bronze → silver transition
SELECT
  CAST(order_id AS BIGINT) AS order_id,
  CAST(amount AS DECIMAL(12,2)) AS amount,
  TO_DATE(order_date, 'yyyy-MM-dd') AS order_date,
  CASE WHEN active_flag = 'Y' THEN TRUE ELSE FALSE END AS is_active
FROM bronze.raw_orders;

Handling duplicates

# Find duplicates
dupes = df.groupBy("order_id").count().filter("count > 1")

# Remove duplicates (keep first occurrence)
clean_df = df.dropDuplicates(["order_id"])

# Deduplicate with window function (keep latest)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy("order_id").orderBy(col("updated_at").desc())
deduped = (df
    .withColumn("rn", row_number().over(window))
    .filter("rn = 1")
    .drop("rn"))

Handling nulls and missing values

# Drop rows where ALL columns are null
clean = df.na.drop(how="all")

# Drop rows where specific columns are null
clean = df.na.drop(subset=["customer_id", "amount"])

# Fill nulls with defaults
clean = df.na.fill({
    "email": "unknown@example.com",
    "amount": 0.0,
    "region": "UNASSIGNED"
})

# Replace nulls with column-specific logic
from pyspark.sql.functions import coalesce, lit
clean = df.withColumn("region", coalesce(col("region"), lit("UNASSIGNED")))

Question

What does data profiling tell you and why do it first?

Click or press Enter to reveal answer

Answer

Profiling reveals row counts, null percentages, distinct values, min/max ranges, and data type mismatches. Do it first to catch quality issues before they corrupt downstream transformations.

Click to flip back

Question

What are the three strategies for handling null values?

Click or press Enter to reveal answer

Answer

1) Drop rows with nulls (na.drop). 2) Fill with defaults (na.fill). 3) Replace with logic (coalesce with fallback values). Choose based on whether nulls mean 'missing' or 'not applicable'.

Click to flip back

Question

Why should postcodes be STRING, not INTEGER?

Click or press Enter to reveal answer

Answer

Numeric types drop leading zeros. Postcode '01234' becomes 1234 as an integer. STRING preserves the exact value.

Click to flip back

Knowledge check

Knowledge Check

Ravi discovers that 15% of customer records in DataPulse's bronze table have NULL email addresses. These customers are valid but simply haven't provided an email. What should he do?

Next up: Transforming & Loading Data — filter, group, join, union, pivot, merge, and append operations.