Transforming & Loading Data

Transform operations

Simple explanation

Transformations are the cooking steps in your data kitchen.

Raw ingredients (bronze data) get chopped (filtered), mixed (joined), measured (aggregated), and plated (loaded). Each operation reshapes data towards its final form.

-- Filter
SELECT * FROM sales WHERE amount > 100 AND region = 'APAC';

-- Group and aggregate
SELECT region, product_category,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue,
  AVG(amount) AS avg_order_value
FROM sales
GROUP BY region, product_category
HAVING SUM(amount) > 10000;

Joins

Join Type	Returns	Use Case
INNER JOIN	Only matching rows from both tables	Standard lookups
LEFT JOIN	All left rows + matching right rows	Keep all orders even if customer is missing
RIGHT JOIN	All right rows + matching left rows	Rarely used (rewrite as LEFT JOIN)
FULL OUTER JOIN	All rows from both tables	Reconciliation, finding orphans
CROSS JOIN	Every combination (cartesian product)	Generating test data, date spine
ANTI JOIN	Left rows that have NO match in right	Finding orphans, data quality

-- Enrich orders with customer info
SELECT o.*, c.name, c.region
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- Find orders with no matching customer (anti join)
SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;

Set operations

-- UNION ALL: combine all rows (keeps duplicates)
SELECT product_id, amount FROM online_sales
UNION ALL
SELECT product_id, amount FROM store_sales;

-- UNION: combine rows (removes duplicates)
SELECT customer_id FROM premium_customers
UNION
SELECT customer_id FROM loyalty_customers;

-- INTERSECT: rows in BOTH sets
SELECT customer_id FROM online_customers
INTERSECT
SELECT customer_id FROM store_customers;

-- EXCEPT: rows in first set but NOT in second
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM churned_customers;

Pivot and unpivot

-- PIVOT: rows to columns (monthly revenue per region)
SELECT * FROM (
  SELECT region, month, revenue FROM monthly_sales
)
PIVOT (SUM(revenue) FOR month IN ('Jan', 'Feb', 'Mar'));
-- Result: region | Jan | Feb | Mar

-- UNPIVOT: columns to rows (reverse of pivot)
SELECT * FROM pivoted_sales
UNPIVOT (revenue FOR month IN (Jan, Feb, Mar));
-- Result: region | month | revenue

Mei Lin uses PIVOT to create Freshmart’s monthly dashboard table where each month becomes a column for easy comparison.

Loading operations

INSERT INTO (append)

-- Append new rows
INSERT INTO silver.orders
SELECT * FROM staging.new_orders;

INSERT OVERWRITE (replace partition)

-- Replace a specific partition
INSERT OVERWRITE silver.orders
PARTITION (order_date = '2026-04-21')
SELECT * FROM staging.daily_orders WHERE order_date = '2026-04-21';

MERGE INTO (upsert)

The most powerful loading operation — handles inserts, updates, and deletes in one statement:

MERGE INTO silver.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id

-- Update existing records
WHEN MATCHED THEN UPDATE SET
  target.name = source.name,
  target.email = source.email,
  target.updated_at = CURRENT_TIMESTAMP()

-- Insert new records
WHEN NOT MATCHED THEN INSERT *

-- Delete records not in source (optional)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

Operation	INSERT INTO	INSERT OVERWRITE	MERGE INTO
Action	Append rows	Replace partition/table	Insert + update + delete
Duplicates?	Possible	No (replaces)	No (match on key)
Idempotent?	No	Yes	Yes
Best for	Append-only loads	Full partition refresh	SCD, upsert, sync

Question

What is the difference between UNION and UNION ALL?

Click or press Enter to reveal answer

Answer

UNION removes duplicate rows (slower — requires comparison). UNION ALL keeps all rows including duplicates (faster — no comparison). Use UNION ALL unless you specifically need deduplication.

Click to flip back

Question

What is the MERGE INTO statement used for?

Click or press Enter to reveal answer

Answer

MERGE (upsert) handles inserts, updates, and deletes in one statement. Match on a key column: update existing rows (WHEN MATCHED), insert new rows (WHEN NOT MATCHED), optionally delete (WHEN NOT MATCHED BY SOURCE).

Click to flip back

Question

When should you use PIVOT vs UNPIVOT?

Click or press Enter to reveal answer

Answer

PIVOT: turn row values into columns (e.g., months as columns for a dashboard). UNPIVOT: turn columns back into rows (e.g., column-per-month back to month + value rows for analysis).

Click to flip back

Knowledge check

Knowledge Check

Ravi needs to sync DataPulse's customer table with daily updates from the CRM. Some customers are new (insert), some changed (update), and some were deleted. Which SQL operation handles all three cases?

Next up: Data Quality & Schema Enforcement — validation checks, schema enforcement, schema drift, and pipeline expectations.

Transform operations

Filter, group, aggregate

Joins

Set operations

Pivot and unpivot

Loading operations

INSERT INTO (append)

INSERT OVERWRITE (replace partition)

MERGE INTO (upsert)

Knowledge check