Dimensional Modeling: Prep for Analytics

What is dimensional modeling?

Simple explanation

Think of a receipt from a shop.

The receipt has facts — what you bought, how many, the price. But it also references things that don’t change often: the store (its address, region, manager), the product (its category, brand, weight), and the date (which quarter, which financial year).

A dimensional model separates these two things: the fact table (the receipt — big, event-driven, grows every day) and the dimension tables (the reference data — store, product, date — smaller, changes slowly).

This design makes analytics fast because Power BI and SQL can slice facts by dimensions: “Show me revenue by region, by product category, by quarter.”

Star schema design

A star schema has one fact table at the centre surrounded by dimension tables.

          ┌──────────────┐
          │  DimProduct  │
          └──────┬───────┘
                 │
┌────────────┐   │   ┌──────────────┐
│  DimStore  ├───┼───┤  FactSales   │
└────────────┘   │   └──────┬───────┘
                 │          │
          ┌──────┴───────┐  │
          │  DimDate     │  │
          └──────────────┘  │
                            │
                    ┌───────┴────────┐
                    │  DimCustomer   │
                    └────────────────┘

Fact tables

Characteristic	Detail
Contains	Measures (revenue, quantity, cost) + foreign keys to dimensions
Grain	One row per event (one sale, one production run, one order line)
Size	Large — millions to billions of rows
Growth	Appended daily/hourly (new events)
Typical columns	`date_key`, `product_key`, `store_key`, `customer_key`, `quantity`, `revenue`, `cost`

Dimension tables

Characteristic	Detail
Contains	Descriptive attributes for filtering and grouping
Size	Small to medium — thousands to low millions
Growth	Slow — new products, new stores, address changes
Typical columns	`product_key`, `product_name`, `category`, `brand`, `weight`, `is_active`

Scenario: Carlos designs a production star schema

Carlos models Precision Manufacturing’s production data:

FactProduction (grain: one row per production batch)

date_key, factory_key, product_key, machine_key
Measures: units_produced, units_defective, runtime_minutes, energy_kwh

Dimensions: DimDate, DimFactory (name, region, capacity), DimProduct (name, category, weight), DimMachine (type, installation_date, maintenance_due)

Power BI can now slice defect rates by factory, by product category, by quarter — without complex joins at query time.

Slowly Changing Dimensions (SCD)

Dimension data changes over time. A customer moves to a new city. A product changes category. How you handle these changes is called the Slowly Changing Dimension strategy.

Type 1 overwrites; Type 2 preserves history
Feature	SCD Type 1	SCD Type 2
What happens on change	Overwrite the old value with the new value	Keep the old row, add a new row with the new value
History preserved?	No — only the current value exists	Yes — both old and new values exist with date ranges
Extra columns needed	None	effective_date, end_date, is_current flag
Table size impact	No growth from changes	Grows with each change (one new row per change)
Use when	History doesn't matter (fix a typo, update a phone number)	History matters (customer moved — need to attribute past sales to old region)
Implementation	MERGE with WHEN MATCHED THEN UPDATE	MERGE with UPDATE (expire old row) + INSERT (new row)

SCD Type 1 with Delta MERGE

-- Overwrite: update existing rows, insert new ones
MERGE INTO DimCustomer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET target.city = source.city,
               target.phone = source.phone
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, city, phone)
    VALUES (source.customer_id, source.name, source.city, source.phone)

SCD Type 2 with Delta MERGE

-- Step 1: Expire existing rows that have changes
MERGE INTO DimCustomer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
   AND target.is_current = true
WHEN MATCHED AND (target.city != source.city OR target.phone != source.phone) THEN
    UPDATE SET target.is_current = false,
               target.end_date = current_date()

-- Step 2: Insert new current rows for changed records
INSERT INTO DimCustomer (customer_id, name, city, phone, effective_date, end_date, is_current)
SELECT customer_id, name, city, phone, current_date(), NULL, true
FROM staging_customers s
WHERE EXISTS (
    SELECT 1 FROM DimCustomer d
    WHERE d.customer_id = s.customer_id
      AND d.is_current = false
      AND d.end_date = current_date()
)

Exam tip: Choosing SCD type

The exam often describes a scenario and asks which SCD type to use:

“A customer changes their email address — we don’t need to track old emails” → Type 1 (overwrite)
“A customer moves to a new city — past sales should be attributed to the old city” → Type 2 (preserve history)
“Fix a misspelled product name” → Type 1 (correction, not a meaningful change)
“A product changes category — reports should show which category it was in at the time of each sale” → Type 2

The deciding question: “Does the old value need to be associated with historical facts?” If yes → Type 2.

Denormalization

In operational databases, data is normalised (split across many tables to avoid duplication). For analytics, you denormalise — flatten joins into fewer, wider tables for faster queries.

Normalised (source)	Denormalised (lakehouse)
Orders → OrderLines → Products → Categories	FactSales with product_name, category_name already joined
Employees → Departments → Regions	DimEmployee with department_name, region_name included

Denormalization happens during transformation (PySpark joins or SQL views) — you take normalised source data and produce star schema tables.

Question

What is the difference between a fact table and a dimension table?

Click or press Enter to reveal answer

Answer

Fact tables contain quantitative measures (revenue, quantity) and foreign keys to dimensions. They're large and grow with events. Dimension tables contain descriptive attributes (product name, region) for filtering and grouping. They're smaller and change slowly.

Click to flip back

Question

When should you use SCD Type 2 instead of Type 1?

Click or press Enter to reveal answer

Answer

Use Type 2 when history matters — when you need to associate old dimension values with historical facts. Example: a customer moves cities, and past sales should still be attributed to the old city. Type 1 overwrites (no history).

Click to flip back

Question

What is denormalization in data engineering?

Click or press Enter to reveal answer

Answer

Flattening joins from normalised source tables into fewer, wider tables optimised for analytics. Example: joining Orders + Products + Categories into a single FactSales table with product_name and category_name included.

Click to flip back

Knowledge Check

A product in Precision Manufacturing changes from the 'Standard' category to 'Premium'. Historical reports must show the product in the 'Standard' category for past production data and 'Premium' for current data. Which SCD type should Carlos use?

Knowledge Check

An analyst asks Carlos: 'Why is the DimCustomer table in the lakehouse wider (more columns) than the Customer table in the source SAP system?' What is the most accurate answer?

Next up: Data Stores & Tools: Make the Right Choice — decide between lakehouses, warehouses, and KQL databases, and pick the best transformation tool.