Dimensional Modeling: Prep for Analytics
Design star schemas for Fabric lakehouses. Build fact tables, dimension tables, and handle slowly changing dimensions (SCD Type 1 and Type 2).
What is dimensional modeling?
Think of a receipt from a shop.
The receipt has facts β what you bought, how many, the price. But it also references things that donβt change often: the store (its address, region, manager), the product (its category, brand, weight), and the date (which quarter, which financial year).
A dimensional model separates these two things: the fact table (the receipt β big, event-driven, grows every day) and the dimension tables (the reference data β store, product, date β smaller, changes slowly).
This design makes analytics fast because Power BI and SQL can slice facts by dimensions: βShow me revenue by region, by product category, by quarter.β
Star schema design
A star schema has one fact table at the centre surrounded by dimension tables.
ββββββββββββββββ
β DimProduct β
ββββββββ¬ββββββββ
β
ββββββββββββββ β ββββββββββββββββ
β DimStore βββββΌββββ€ FactSales β
ββββββββββββββ β ββββββββ¬ββββββββ
β β
ββββββββ΄ββββββββ β
β DimDate β β
ββββββββββββββββ β
β
βββββββββ΄βββββββββ
β DimCustomer β
ββββββββββββββββββ
Fact tables
| Characteristic | Detail |
|---|---|
| Contains | Measures (revenue, quantity, cost) + foreign keys to dimensions |
| Grain | One row per event (one sale, one production run, one order line) |
| Size | Large β millions to billions of rows |
| Growth | Appended daily/hourly (new events) |
| Typical columns | date_key, product_key, store_key, customer_key, quantity, revenue, cost |
Dimension tables
| Characteristic | Detail |
|---|---|
| Contains | Descriptive attributes for filtering and grouping |
| Size | Small to medium β thousands to low millions |
| Growth | Slow β new products, new stores, address changes |
| Typical columns | product_key, product_name, category, brand, weight, is_active |
Scenario: Carlos designs a production star schema
Carlos models Precision Manufacturingβs production data:
FactProduction (grain: one row per production batch)
date_key,factory_key,product_key,machine_key- Measures:
units_produced,units_defective,runtime_minutes,energy_kwh
Dimensions: DimDate, DimFactory (name, region, capacity), DimProduct (name, category, weight), DimMachine (type, installation_date, maintenance_due)
Power BI can now slice defect rates by factory, by product category, by quarter β without complex joins at query time.
Slowly Changing Dimensions (SCD)
Dimension data changes over time. A customer moves to a new city. A product changes category. How you handle these changes is called the Slowly Changing Dimension strategy.
| Feature | SCD Type 1 | SCD Type 2 |
|---|---|---|
| What happens on change | Overwrite the old value with the new value | Keep the old row, add a new row with the new value |
| History preserved? | No β only the current value exists | Yes β both old and new values exist with date ranges |
| Extra columns needed | None | effective_date, end_date, is_current flag |
| Table size impact | No growth from changes | Grows with each change (one new row per change) |
| Use when | History doesn't matter (fix a typo, update a phone number) | History matters (customer moved β need to attribute past sales to old region) |
| Implementation | MERGE with WHEN MATCHED THEN UPDATE | MERGE with UPDATE (expire old row) + INSERT (new row) |
SCD Type 1 with Delta MERGE
-- Overwrite: update existing rows, insert new ones
MERGE INTO DimCustomer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET target.city = source.city,
target.phone = source.phone
WHEN NOT MATCHED THEN
INSERT (customer_id, name, city, phone)
VALUES (source.customer_id, source.name, source.city, source.phone)
SCD Type 2 with Delta MERGE
-- Step 1: Expire existing rows that have changes
MERGE INTO DimCustomer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
AND target.is_current = true
WHEN MATCHED AND (target.city != source.city OR target.phone != source.phone) THEN
UPDATE SET target.is_current = false,
target.end_date = current_date()
-- Step 2: Insert new current rows for changed records
INSERT INTO DimCustomer (customer_id, name, city, phone, effective_date, end_date, is_current)
SELECT customer_id, name, city, phone, current_date(), NULL, true
FROM staging_customers s
WHERE EXISTS (
SELECT 1 FROM DimCustomer d
WHERE d.customer_id = s.customer_id
AND d.is_current = false
AND d.end_date = current_date()
)
Exam tip: Choosing SCD type
The exam often describes a scenario and asks which SCD type to use:
- βA customer changes their email address β we donβt need to track old emailsβ β Type 1 (overwrite)
- βA customer moves to a new city β past sales should be attributed to the old cityβ β Type 2 (preserve history)
- βFix a misspelled product nameβ β Type 1 (correction, not a meaningful change)
- βA product changes category β reports should show which category it was in at the time of each saleβ β Type 2
The deciding question: βDoes the old value need to be associated with historical facts?β If yes β Type 2.
Denormalization
In operational databases, data is normalised (split across many tables to avoid duplication). For analytics, you denormalise β flatten joins into fewer, wider tables for faster queries.
| Normalised (source) | Denormalised (lakehouse) |
|---|---|
| Orders β OrderLines β Products β Categories | FactSales with product_name, category_name already joined |
| Employees β Departments β Regions | DimEmployee with department_name, region_name included |
Denormalization happens during transformation (PySpark joins or SQL views) β you take normalised source data and produce star schema tables.
A product in Precision Manufacturing changes from the 'Standard' category to 'Premium'. Historical reports must show the product in the 'Standard' category for past production data and 'Premium' for current data. Which SCD type should Carlos use?
An analyst asks Carlos: 'Why is the DimCustomer table in the lakehouse wider (more columns) than the Customer table in the source SAP system?' What is the most accurate answer?
π¬ Video coming soon
Next up: Data Stores & Tools: Make the Right Choice β decide between lakehouses, warehouses, and KQL databases, and pick the best transformation tool.