Transforming Data: Reshape & Enrich

What does “transform” mean?

Simple explanation

Think of transforming data like preparing ingredients for a meal.

Raw ingredients arrive from the farm: unpeeled, unsorted, some bruised. Before you can cook (analyse), you need to wash, peel, chop, and organise. That is data transformation — taking raw data and reshaping it into something ready for analysis.

Common transformations: adding a “full name” column from first + last name (enriching), combining two tables into one (merging), flattening hierarchies into a single table (denormalizing), and pre-computing totals (aggregating).

Enriching data: Adding columns and tables

Enrichment adds business context to raw data.

Adding computed columns (SQL)

-- Add a profit margin column to the product summary
ALTER TABLE dbo.product_summary
ADD margin_pct AS (revenue - cost) / NULLIF(revenue, 0);

-- Add a fiscal quarter based on date (example: April = Q1 for July-start fiscal year)
SELECT *,
    CASE
        WHEN MONTH(order_date) >= 7 THEN 'FY' + CAST(YEAR(order_date)+1 AS VARCHAR) + '-Q' + CAST((MONTH(order_date)-7)/3 + 1 AS VARCHAR)
        ELSE 'FY' + CAST(YEAR(order_date) AS VARCHAR) + '-Q' + CAST((MONTH(order_date)+5)/3 + 1 AS VARCHAR)
    END AS fiscal_quarter
FROM dbo.fact_sales;

Adding computed columns (PySpark)

from pyspark.sql.functions import col, when, concat, lit

# Add a loyalty segment based on purchase count
df_enriched = df_customers.withColumn(
    "loyalty_segment",
    when(col("total_purchases") >= 100, "Platinum")
    .when(col("total_purchases") >= 50, "Gold")
    .when(col("total_purchases") >= 10, "Silver")
    .otherwise("Bronze")
)

Adding lookup tables

Sometimes enrichment means adding an entirely new table that provides context:

-- Create a holiday calendar dimension for seasonal analysis
CREATE TABLE dbo.dim_holidays (
    date_key INT,
    holiday_name NVARCHAR(100),
    is_public_holiday BIT
);

Denormalizing data

Denormalization flattens normalized tables into fewer, wider tables — the standard pattern for moving from Silver to Gold layer.

Example: Flattening a product hierarchy

Before (normalised — 3 tables):

product → subcategory → category

After (denormalised — 1 table):

CREATE TABLE dbo.dim_product_denorm AS
SELECT
    p.product_id,
    p.product_name,
    p.unit_price,
    sc.subcategory_name,
    c.category_name,
    c.department
FROM dbo.products p
JOIN dbo.subcategories sc ON p.subcategory_id = sc.subcategory_id
JOIN dbo.categories c ON sc.category_id = c.category_id;

Now queries need only one join instead of three.

Exam tip: Denormalization trade-offs

The exam tests whether you understand the trade-off:

Pro: Faster reads — fewer joins, simpler queries, better Power BI performance
Pro: Simpler semantic models — relationships are cleaner in a star schema
Con: Data redundancy — category name is repeated on every product row
Con: Update complexity — if a category name changes, every product row needs updating

For analytical workloads (DP-600’s focus), the pros almost always outweigh the cons. Denormalize your dimensions.

Aggregating data

Aggregation pre-computes summaries to speed up common queries.

SQL aggregation

-- Create a daily aggregate from transaction-level data
CREATE TABLE dbo.agg_daily_sales AS
SELECT
    date_key,
    store_key,
    product_key,
    SUM(quantity) AS total_quantity,
    SUM(total_amount) AS total_revenue,
    COUNT(*) AS transaction_count,
    AVG(total_amount) AS avg_transaction_value
FROM dbo.fact_sales
GROUP BY date_key, store_key, product_key;

PySpark aggregation

# Daily aggregate by store
df_daily = df_sales.groupBy("date_key", "store_key") \
    .agg(
        F.sum("total_amount").alias("total_revenue"),
        F.count("sale_id").alias("transaction_count"),
        F.avg("total_amount").alias("avg_transaction")
    )

# Write to a Gold-layer Delta table
df_daily.write.format("delta").mode("overwrite").saveAsTable("gold_daily_store_sales")

Merging and joining data

Combining data from multiple sources is one of the most common transformation tasks.

Join types

Join	What It Returns	Use Case
INNER JOIN	Only rows with matches in both tables	Match sales to products
LEFT JOIN	All rows from left + matching from right (NULL if no match)	All customers, even those with no sales
FULL OUTER JOIN	All rows from both tables	Reconciliation — find unmatched records
CROSS JOIN	Every combination of rows	Generate a date-store matrix for gap analysis

MERGE (upsert pattern)

MERGE dbo.dim_product AS target
USING staging.new_products AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN
    UPDATE SET target.product_name = source.product_name,
               target.unit_price = source.unit_price
WHEN NOT MATCHED THEN
    INSERT (product_id, product_name, category, unit_price)
    VALUES (source.product_id, source.product_name, source.category, source.unit_price);

Scenario: Anita merges supplier updates

Anita at FreshCart receives weekly product updates from suppliers — some are new products, some are price changes. She uses MERGE to handle both in one operation:

WHEN MATCHED: Update the price and description
WHEN NOT MATCHED: Insert the new product
WHEN NOT MATCHED BY SOURCE: (Optional) Flag discontinued products

This single MERGE statement replaces what would otherwise be a three-step process: check if exists, update or insert, flag missing.

Question

What is the difference between enriching and denormalizing?

Click or press Enter to reveal answer

Answer

Enriching adds NEW information (computed columns, lookup tables). Denormalizing flattens EXISTING information (merging related tables into fewer, wider tables). Both improve analytical usability, but they solve different problems.

Click to flip back

Question

What does the MERGE statement do in SQL?

Click or press Enter to reveal answer

Answer

MERGE combines INSERT, UPDATE, and DELETE into one atomic operation. It compares a source dataset against a target table and applies different actions based on matches: update existing rows, insert new rows, optionally delete/flag missing rows.

Click to flip back

Question

Why create aggregate tables?

Click or press Enter to reveal answer

Answer

Aggregate tables pre-compute common summaries (daily totals, monthly averages) so queries do not need to scan billions of transaction-level rows. They dramatically improve dashboard performance while the detail table remains available for drill-down.

Click to flip back

Knowledge Check

Anita at FreshCart has a normalised product hierarchy: product → subcategory → category → department. A Power BI report that shows 'revenue by department' requires three joins. What should Anita do to improve performance?

Knowledge Check

Raj at Atlas Capital receives daily position updates from a trading platform. Some records are new positions; others are updates to existing positions. He needs a single operation that inserts new records and updates existing ones. Which SQL pattern should he use?

Next up: Data Quality & Cleansing — find and fix duplicates, missing values, type mismatches, and other data quality issues.