Transforming & Loading Data
Filter, group, aggregate, join, union, pivot, unpivot, merge, insert, and append β the transformation operations that turn raw data into analytics-ready tables.
Transform operations
Transformations are the cooking steps in your data kitchen.
Raw ingredients (bronze data) get chopped (filtered), mixed (joined), measured (aggregated), and plated (loaded). Each operation reshapes data towards its final form.
Filter, group, aggregate
-- Filter
SELECT * FROM sales WHERE amount > 100 AND region = 'APAC';
-- Group and aggregate
SELECT region, product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM sales
GROUP BY region, product_category
HAVING SUM(amount) > 10000;
Joins
| Join Type | Returns | Use Case |
|---|---|---|
| INNER JOIN | Only matching rows from both tables | Standard lookups |
| LEFT JOIN | All left rows + matching right rows | Keep all orders even if customer is missing |
| RIGHT JOIN | All right rows + matching left rows | Rarely used (rewrite as LEFT JOIN) |
| FULL OUTER JOIN | All rows from both tables | Reconciliation, finding orphans |
| CROSS JOIN | Every combination (cartesian product) | Generating test data, date spine |
| ANTI JOIN | Left rows that have NO match in right | Finding orphans, data quality |
-- Enrich orders with customer info
SELECT o.*, c.name, c.region
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- Find orders with no matching customer (anti join)
SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c ON o.customer_id = c.customer_id;
Set operations
-- UNION ALL: combine all rows (keeps duplicates)
SELECT product_id, amount FROM online_sales
UNION ALL
SELECT product_id, amount FROM store_sales;
-- UNION: combine rows (removes duplicates)
SELECT customer_id FROM premium_customers
UNION
SELECT customer_id FROM loyalty_customers;
-- INTERSECT: rows in BOTH sets
SELECT customer_id FROM online_customers
INTERSECT
SELECT customer_id FROM store_customers;
-- EXCEPT: rows in first set but NOT in second
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM churned_customers;
Pivot and unpivot
-- PIVOT: rows to columns (monthly revenue per region)
SELECT * FROM (
SELECT region, month, revenue FROM monthly_sales
)
PIVOT (SUM(revenue) FOR month IN ('Jan', 'Feb', 'Mar'));
-- Result: region | Jan | Feb | Mar
-- UNPIVOT: columns to rows (reverse of pivot)
SELECT * FROM pivoted_sales
UNPIVOT (revenue FOR month IN (Jan, Feb, Mar));
-- Result: region | month | revenue
Mei Lin uses PIVOT to create Freshmartβs monthly dashboard table where each month becomes a column for easy comparison.
Loading operations
INSERT INTO (append)
-- Append new rows
INSERT INTO silver.orders
SELECT * FROM staging.new_orders;
INSERT OVERWRITE (replace partition)
-- Replace a specific partition
INSERT OVERWRITE silver.orders
PARTITION (order_date = '2026-04-21')
SELECT * FROM staging.daily_orders WHERE order_date = '2026-04-21';
MERGE INTO (upsert)
The most powerful loading operation β handles inserts, updates, and deletes in one statement:
MERGE INTO silver.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
-- Update existing records
WHEN MATCHED THEN UPDATE SET
target.name = source.name,
target.email = source.email,
target.updated_at = CURRENT_TIMESTAMP()
-- Insert new records
WHEN NOT MATCHED THEN INSERT *
-- Delete records not in source (optional)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
| Operation | INSERT INTO | INSERT OVERWRITE | MERGE INTO |
|---|---|---|---|
| Action | Append rows | Replace partition/table | Insert + update + delete |
| Duplicates? | Possible | No (replaces) | No (match on key) |
| Idempotent? | No | Yes | Yes |
| Best for | Append-only loads | Full partition refresh | SCD, upsert, sync |
π¬ Video coming soon
Knowledge check
Ravi needs to sync DataPulse's customer table with daily updates from the CRM. Some customers are new (insert), some changed (update), and some were deleted. Which SQL operation handles all three cases?
Next up: Data Quality & Schema Enforcement β validation checks, schema enforcement, schema drift, and pipeline expectations.