Cleansing & Profiling Data
Profile data to understand distributions, choose appropriate column types, and resolve duplicates, missing values, and nulls β the cleaning step that makes data trustworthy.
Why profiling comes first
You wouldnβt cook with ingredients you havenβt inspected.
Data profiling is checking your ingredients before you start cooking. How many rows? Any spoiled items (nulls)? Any duplicates (two identical tomatoes)? Are the measurements right (correct data types)?
Skipping this step leads to pipelines that silently produce wrong results β the most dangerous kind of bug.
Data profiling techniques
# Quick profile with describe()
df.describe().show()
# Detailed profiling with dbutils
dbutils.data.summarize(df) # Databricks-specific: visual summary
# Custom profiling queries
from pyspark.sql.functions import count, countDistinct, sum as _sum, col, isnan, when
profile = df.select([
count("*").alias("total_rows"),
countDistinct("customer_id").alias("unique_customers"),
_sum(when(col("email").isNull(), 1).otherwise(0)).alias("null_emails"),
_sum(when(col("amount") < 0, 1).otherwise(0)).alias("negative_amounts")
])
profile.show()
-- SQL profiling
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) - COUNT(email) AS null_emails,
MIN(amount) AS min_amount,
MAX(amount) AS max_amount,
AVG(amount) AS avg_amount
FROM bronze.raw_orders;
Choosing column data types
| Data | Wrong Type | Right Type | Why It Matters |
|---|---|---|---|
| Currency amounts | STRING | DECIMAL(12,2) | String arithmetic fails; precision matters for money |
| Dates | STRING (β2026-04-21β) | DATE or TIMESTAMP | String dates canβt be compared, aggregated, or filtered properly |
| IDs | INTEGER | BIGINT or STRING | Integer IDs overflow; some IDs have leading zeros |
| Boolean flags | STRING (βYβ/βNβ) | BOOLEAN | Cleaner filters, smaller storage |
| Postcodes | INTEGER | STRING | Leading zeros are lost in numeric types (e.g., β01onalβ β 1) |
-- Cast types during the bronze β silver transition
SELECT
CAST(order_id AS BIGINT) AS order_id,
CAST(amount AS DECIMAL(12,2)) AS amount,
TO_DATE(order_date, 'yyyy-MM-dd') AS order_date,
CASE WHEN active_flag = 'Y' THEN TRUE ELSE FALSE END AS is_active
FROM bronze.raw_orders;
Handling duplicates
# Find duplicates
dupes = df.groupBy("order_id").count().filter("count > 1")
# Remove duplicates (keep first occurrence)
clean_df = df.dropDuplicates(["order_id"])
# Deduplicate with window function (keep latest)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("order_id").orderBy(col("updated_at").desc())
deduped = (df
.withColumn("rn", row_number().over(window))
.filter("rn = 1")
.drop("rn"))
Handling nulls and missing values
# Drop rows where ALL columns are null
clean = df.na.drop(how="all")
# Drop rows where specific columns are null
clean = df.na.drop(subset=["customer_id", "amount"])
# Fill nulls with defaults
clean = df.na.fill({
"email": "unknown@example.com",
"amount": 0.0,
"region": "UNASSIGNED"
})
# Replace nulls with column-specific logic
from pyspark.sql.functions import coalesce, lit
clean = df.withColumn("region", coalesce(col("region"), lit("UNASSIGNED")))
π¬ Video coming soon
Knowledge check
Ravi discovers that 15% of customer records in DataPulse's bronze table have NULL email addresses. These customers are valid but simply haven't provided an email. What should he do?
Next up: Transforming & Loading Data β filter, group, join, union, pivot, merge, and append operations.