πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 β€” Module 8 of 10 80%
18 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Free ⏱ ~12 min read

Cleansing & Profiling Data

Profile data to understand distributions, choose appropriate column types, and resolve duplicates, missing values, and nulls β€” the cleaning step that makes data trustworthy.

Why profiling comes first

β˜• Simple explanation

You wouldn’t cook with ingredients you haven’t inspected.

Data profiling is checking your ingredients before you start cooking. How many rows? Any spoiled items (nulls)? Any duplicates (two identical tomatoes)? Are the measurements right (correct data types)?

Skipping this step leads to pipelines that silently produce wrong results β€” the most dangerous kind of bug.

Data profiling generates summary statistics (count, min, max, mean, distinct count, null percentage) and assesses value distributions. It reveals data quality issues before you build transformations β€” incorrect types, unexpected nulls, duplicates, and outliers that would corrupt downstream analysis.

Data profiling techniques

# Quick profile with describe()
df.describe().show()

# Detailed profiling with dbutils
dbutils.data.summarize(df)  # Databricks-specific: visual summary

# Custom profiling queries
from pyspark.sql.functions import count, countDistinct, sum as _sum, col, isnan, when

profile = df.select([
    count("*").alias("total_rows"),
    countDistinct("customer_id").alias("unique_customers"),
    _sum(when(col("email").isNull(), 1).otherwise(0)).alias("null_emails"),
    _sum(when(col("amount") < 0, 1).otherwise(0)).alias("negative_amounts")
])
profile.show()
-- SQL profiling
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(*) - COUNT(email) AS null_emails,
  MIN(amount) AS min_amount,
  MAX(amount) AS max_amount,
  AVG(amount) AS avg_amount
FROM bronze.raw_orders;

Choosing column data types

DataWrong TypeRight TypeWhy It Matters
Currency amountsSTRINGDECIMAL(12,2)String arithmetic fails; precision matters for money
DatesSTRING (β€œ2026-04-21”)DATE or TIMESTAMPString dates can’t be compared, aggregated, or filtered properly
IDsINTEGERBIGINT or STRINGInteger IDs overflow; some IDs have leading zeros
Boolean flagsSTRING (β€œY”/β€œN”)BOOLEANCleaner filters, smaller storage
PostcodesINTEGERSTRINGLeading zeros are lost in numeric types (e.g., β€œ01onal” β†’ 1)
-- Cast types during the bronze β†’ silver transition
SELECT
  CAST(order_id AS BIGINT) AS order_id,
  CAST(amount AS DECIMAL(12,2)) AS amount,
  TO_DATE(order_date, 'yyyy-MM-dd') AS order_date,
  CASE WHEN active_flag = 'Y' THEN TRUE ELSE FALSE END AS is_active
FROM bronze.raw_orders;

Handling duplicates

# Find duplicates
dupes = df.groupBy("order_id").count().filter("count > 1")

# Remove duplicates (keep first occurrence)
clean_df = df.dropDuplicates(["order_id"])

# Deduplicate with window function (keep latest)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy("order_id").orderBy(col("updated_at").desc())
deduped = (df
    .withColumn("rn", row_number().over(window))
    .filter("rn = 1")
    .drop("rn"))

Handling nulls and missing values

# Drop rows where ALL columns are null
clean = df.na.drop(how="all")

# Drop rows where specific columns are null
clean = df.na.drop(subset=["customer_id", "amount"])

# Fill nulls with defaults
clean = df.na.fill({
    "email": "unknown@example.com",
    "amount": 0.0,
    "region": "UNASSIGNED"
})

# Replace nulls with column-specific logic
from pyspark.sql.functions import coalesce, lit
clean = df.withColumn("region", coalesce(col("region"), lit("UNASSIGNED")))
Question

What does data profiling tell you and why do it first?

Click or press Enter to reveal answer

Answer

Profiling reveals row counts, null percentages, distinct values, min/max ranges, and data type mismatches. Do it first to catch quality issues before they corrupt downstream transformations.

Click to flip back

Question

What are the three strategies for handling null values?

Click or press Enter to reveal answer

Answer

1) Drop rows with nulls (na.drop). 2) Fill with defaults (na.fill). 3) Replace with logic (coalesce with fallback values). Choose based on whether nulls mean 'missing' or 'not applicable'.

Click to flip back

Question

Why should postcodes be STRING, not INTEGER?

Click or press Enter to reveal answer

Answer

Numeric types drop leading zeros. Postcode '01234' becomes 1234 as an integer. STRING preserves the exact value.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi discovers that 15% of customer records in DataPulse's bronze table have NULL email addresses. These customers are valid but simply haven't provided an email. What should he do?


Next up: Transforming & Loading Data β€” filter, group, join, union, pivot, merge, and append operations.

← Previous

Auto Loader & Declarative Pipelines

Next β†’

Transforming & Loading Data

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.