πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 β€” Module 9 of 10 90%
19 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~14 min read

Transforming & Loading Data

Filter, group, aggregate, join, union, pivot, unpivot, merge, insert, and append β€” the transformation operations that turn raw data into analytics-ready tables.

Transform operations

β˜• Simple explanation

Transformations are the cooking steps in your data kitchen.

Raw ingredients (bronze data) get chopped (filtered), mixed (joined), measured (aggregated), and plated (loaded). Each operation reshapes data towards its final form.

Data transformation in Databricks uses SQL or PySpark DataFrame operations. The exam tests your knowledge of filtering, grouping, aggregating, set operations (UNION, INTERSECT, EXCEPT), joins, pivot/unpivot, and loading operations (MERGE, INSERT, APPEND).

Filter, group, aggregate

-- Filter
SELECT * FROM sales WHERE amount > 100 AND region = 'APAC';

-- Group and aggregate
SELECT region, product_category,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue,
  AVG(amount) AS avg_order_value
FROM sales
GROUP BY region, product_category
HAVING SUM(amount) > 10000;

Joins

Join TypeReturnsUse Case
INNER JOINOnly matching rows from both tablesStandard lookups
LEFT JOINAll left rows + matching right rowsKeep all orders even if customer is missing
RIGHT JOINAll right rows + matching left rowsRarely used (rewrite as LEFT JOIN)
FULL OUTER JOINAll rows from both tablesReconciliation, finding orphans
CROSS JOINEvery combination (cartesian product)Generating test data, date spine
ANTI JOINLeft rows that have NO match in rightFinding orphans, data quality
-- Enrich orders with customer info
SELECT o.*, c.name, c.region
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- Find orders with no matching customer (anti join)
SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;

Set operations

-- UNION ALL: combine all rows (keeps duplicates)
SELECT product_id, amount FROM online_sales
UNION ALL
SELECT product_id, amount FROM store_sales;

-- UNION: combine rows (removes duplicates)
SELECT customer_id FROM premium_customers
UNION
SELECT customer_id FROM loyalty_customers;

-- INTERSECT: rows in BOTH sets
SELECT customer_id FROM online_customers
INTERSECT
SELECT customer_id FROM store_customers;

-- EXCEPT: rows in first set but NOT in second
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM churned_customers;

Pivot and unpivot

-- PIVOT: rows to columns (monthly revenue per region)
SELECT * FROM (
  SELECT region, month, revenue FROM monthly_sales
)
PIVOT (SUM(revenue) FOR month IN ('Jan', 'Feb', 'Mar'));
-- Result: region | Jan | Feb | Mar

-- UNPIVOT: columns to rows (reverse of pivot)
SELECT * FROM pivoted_sales
UNPIVOT (revenue FOR month IN (Jan, Feb, Mar));
-- Result: region | month | revenue

Mei Lin uses PIVOT to create Freshmart’s monthly dashboard table where each month becomes a column for easy comparison.

Loading operations

INSERT INTO (append)

-- Append new rows
INSERT INTO silver.orders
SELECT * FROM staging.new_orders;

INSERT OVERWRITE (replace partition)

-- Replace a specific partition
INSERT OVERWRITE silver.orders
PARTITION (order_date = '2026-04-21')
SELECT * FROM staging.daily_orders WHERE order_date = '2026-04-21';

MERGE INTO (upsert)

The most powerful loading operation β€” handles inserts, updates, and deletes in one statement:

MERGE INTO silver.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id

-- Update existing records
WHEN MATCHED THEN UPDATE SET
  target.name = source.name,
  target.email = source.email,
  target.updated_at = CURRENT_TIMESTAMP()

-- Insert new records
WHEN NOT MATCHED THEN INSERT *

-- Delete records not in source (optional)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
OperationINSERT INTOINSERT OVERWRITEMERGE INTO
ActionAppend rowsReplace partition/tableInsert + update + delete
Duplicates?PossibleNo (replaces)No (match on key)
Idempotent?NoYesYes
Best forAppend-only loadsFull partition refreshSCD, upsert, sync
Question

What is the difference between UNION and UNION ALL?

Click or press Enter to reveal answer

Answer

UNION removes duplicate rows (slower β€” requires comparison). UNION ALL keeps all rows including duplicates (faster β€” no comparison). Use UNION ALL unless you specifically need deduplication.

Click to flip back

Question

What is the MERGE INTO statement used for?

Click or press Enter to reveal answer

Answer

MERGE (upsert) handles inserts, updates, and deletes in one statement. Match on a key column: update existing rows (WHEN MATCHED), insert new rows (WHEN NOT MATCHED), optionally delete (WHEN NOT MATCHED BY SOURCE).

Click to flip back

Question

When should you use PIVOT vs UNPIVOT?

Click or press Enter to reveal answer

Answer

PIVOT: turn row values into columns (e.g., months as columns for a dashboard). UNPIVOT: turn columns back into rows (e.g., column-per-month back to month + value rows for analysis).

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi needs to sync DataPulse's customer table with daily updates from the CRM. Some customers are new (insert), some changed (update), and some were deleted. Which SQL operation handles all three cases?


Next up: Data Quality & Schema Enforcement β€” validation checks, schema enforcement, schema drift, and pipeline expectations.

← Previous

Cleansing & Profiling Data

Next β†’

Data Quality & Schema Enforcement

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.