Transforming Data: Reshape & Enrich
Shape raw data into analytical gold. Add columns, merge tables, denormalize hierarchies, and build aggregate layers β with SQL and Spark examples.
What does βtransformβ mean?
Think of transforming data like preparing ingredients for a meal.
Raw ingredients arrive from the farm: unpeeled, unsorted, some bruised. Before you can cook (analyse), you need to wash, peel, chop, and organise. That is data transformation β taking raw data and reshaping it into something ready for analysis.
Common transformations: adding a βfull nameβ column from first + last name (enriching), combining two tables into one (merging), flattening hierarchies into a single table (denormalizing), and pre-computing totals (aggregating).
Enriching data: Adding columns and tables
Enrichment adds business context to raw data.
Adding computed columns (SQL)
-- Add a profit margin column to the product summary
ALTER TABLE dbo.product_summary
ADD margin_pct AS (revenue - cost) / NULLIF(revenue, 0);
-- Add a fiscal quarter based on date (example: April = Q1 for July-start fiscal year)
SELECT *,
CASE
WHEN MONTH(order_date) >= 7 THEN 'FY' + CAST(YEAR(order_date)+1 AS VARCHAR) + '-Q' + CAST((MONTH(order_date)-7)/3 + 1 AS VARCHAR)
ELSE 'FY' + CAST(YEAR(order_date) AS VARCHAR) + '-Q' + CAST((MONTH(order_date)+5)/3 + 1 AS VARCHAR)
END AS fiscal_quarter
FROM dbo.fact_sales;
Adding computed columns (PySpark)
from pyspark.sql.functions import col, when, concat, lit
# Add a loyalty segment based on purchase count
df_enriched = df_customers.withColumn(
"loyalty_segment",
when(col("total_purchases") >= 100, "Platinum")
.when(col("total_purchases") >= 50, "Gold")
.when(col("total_purchases") >= 10, "Silver")
.otherwise("Bronze")
)
Adding lookup tables
Sometimes enrichment means adding an entirely new table that provides context:
-- Create a holiday calendar dimension for seasonal analysis
CREATE TABLE dbo.dim_holidays (
date_key INT,
holiday_name NVARCHAR(100),
is_public_holiday BIT
);
Denormalizing data
Denormalization flattens normalized tables into fewer, wider tables β the standard pattern for moving from Silver to Gold layer.
Example: Flattening a product hierarchy
Before (normalised β 3 tables):
product β subcategory β category
After (denormalised β 1 table):
CREATE TABLE dbo.dim_product_denorm AS
SELECT
p.product_id,
p.product_name,
p.unit_price,
sc.subcategory_name,
c.category_name,
c.department
FROM dbo.products p
JOIN dbo.subcategories sc ON p.subcategory_id = sc.subcategory_id
JOIN dbo.categories c ON sc.category_id = c.category_id;
Now queries need only one join instead of three.
Exam tip: Denormalization trade-offs
The exam tests whether you understand the trade-off:
- Pro: Faster reads β fewer joins, simpler queries, better Power BI performance
- Pro: Simpler semantic models β relationships are cleaner in a star schema
- Con: Data redundancy β category name is repeated on every product row
- Con: Update complexity β if a category name changes, every product row needs updating
For analytical workloads (DP-600βs focus), the pros almost always outweigh the cons. Denormalize your dimensions.
Aggregating data
Aggregation pre-computes summaries to speed up common queries.
SQL aggregation
-- Create a daily aggregate from transaction-level data
CREATE TABLE dbo.agg_daily_sales AS
SELECT
date_key,
store_key,
product_key,
SUM(quantity) AS total_quantity,
SUM(total_amount) AS total_revenue,
COUNT(*) AS transaction_count,
AVG(total_amount) AS avg_transaction_value
FROM dbo.fact_sales
GROUP BY date_key, store_key, product_key;
PySpark aggregation
# Daily aggregate by store
df_daily = df_sales.groupBy("date_key", "store_key") \
.agg(
F.sum("total_amount").alias("total_revenue"),
F.count("sale_id").alias("transaction_count"),
F.avg("total_amount").alias("avg_transaction")
)
# Write to a Gold-layer Delta table
df_daily.write.format("delta").mode("overwrite").saveAsTable("gold_daily_store_sales")
Merging and joining data
Combining data from multiple sources is one of the most common transformation tasks.
Join types
| Join | What It Returns | Use Case |
|---|---|---|
| INNER JOIN | Only rows with matches in both tables | Match sales to products |
| LEFT JOIN | All rows from left + matching from right (NULL if no match) | All customers, even those with no sales |
| FULL OUTER JOIN | All rows from both tables | Reconciliation β find unmatched records |
| CROSS JOIN | Every combination of rows | Generate a date-store matrix for gap analysis |
MERGE (upsert pattern)
MERGE dbo.dim_product AS target
USING staging.new_products AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN
UPDATE SET target.product_name = source.product_name,
target.unit_price = source.unit_price
WHEN NOT MATCHED THEN
INSERT (product_id, product_name, category, unit_price)
VALUES (source.product_id, source.product_name, source.category, source.unit_price);
Scenario: Anita merges supplier updates
Anita at FreshCart receives weekly product updates from suppliers β some are new products, some are price changes. She uses MERGE to handle both in one operation:
- WHEN MATCHED: Update the price and description
- WHEN NOT MATCHED: Insert the new product
- WHEN NOT MATCHED BY SOURCE: (Optional) Flag discontinued products
This single MERGE statement replaces what would otherwise be a three-step process: check if exists, update or insert, flag missing.
Anita at FreshCart has a normalised product hierarchy: product β subcategory β category β department. A Power BI report that shows 'revenue by department' requires three joins. What should Anita do to improve performance?
Raj at Atlas Capital receives daily position updates from a trading platform. Some records are new positions; others are updates to existing positions. He needs a single operation that inserts new records and updates existing ones. Which SQL pattern should he use?
π¬ Video coming soon
Next up: Data Quality & Cleansing β find and fix duplicates, missing values, type mismatches, and other data quality issues.