Star Schema Design
The star schema is the most important data modelling pattern in DP-600. Fact tables, dimension tables, grain, and denormalization β with a worked retail example.
What is a star schema?
Think of a receipt and a set of reference cards.
When you buy groceries, your receipt lists every item: product code, quantity, price, store, date, time. That receipt is a fact table β it records what happened (the transaction).
But the receipt alone is not very useful for analysis. To answer βwhich category sold the most last month?β, you need a product card (name, category, brand), a store card (location, region, size), and a date card (month, quarter, year). These are dimension tables.
A star schema puts the fact table in the centre and surrounds it with dimension tables β like a star. It is the foundation of fast analytics and Power BI performance.
Fact tables vs dimension tables
| Feature | Fact Table | Dimension Table |
|---|---|---|
| What it records | Business events β transactions, measurements, metrics | Descriptive attributes β who, what, where, when |
| Rows | Many (millions to billions) | Few (hundreds to millions) |
| Columns | Keys (foreign keys to dimensions) + measures (numbers) | Attributes (text, categories, hierarchies) |
| Changes | Append-mostly β new events added over time | Slowly changing β attributes updated occasionally |
| Examples | Sales transactions, website clicks, sensor readings | Products, customers, stores, dates, employees |
| Grain | The level of detail (one row per transaction, per day, per hour) | One row per entity (one row per product, per customer) |
A worked example: FreshCartβs star schema
Anita at FreshCart designs a star schema for sales analytics. Here is the complete model:
Fact table: fact_sales
| Column | Type | Description |
|---|---|---|
sale_id | INT | Unique transaction line ID |
date_key | INT | FK β dim_date |
store_key | INT | FK β dim_store |
product_key | INT | FK β dim_product |
customer_key | INT | FK β dim_customer |
quantity | INT | Number of items sold (measure) |
unit_price | DECIMAL | Price per item (measure) |
total_amount | DECIMAL | quantity x unit_price (measure) |
discount_amount | DECIMAL | Discount applied (measure) |
Grain: One row per product per transaction (the most detailed level).
Dimension tables
dim_date β when the sale happened
| Column | Example Values |
|---|---|
date_key | 20260421 |
full_date | 2026-04-21 |
day_name | Tuesday |
month_name | April |
quarter | Q2 |
year | 2026 |
is_weekend | No |
is_holiday | No |
dim_store β where the sale happened
| Column | Example Values |
|---|---|
store_key | 1042 |
store_name | FreshCart Auckland CBD |
city | Auckland |
region | North Island |
store_type | Metro |
opening_date | 2019-03-15 |
dim_product β what was sold
| Column | Example Values |
|---|---|
product_key | 5678 |
product_name | Organic Avocado |
category | Fresh Produce |
subcategory | Fruits |
brand | FreshGreen |
unit_of_measure | Each |
dim_customer β who bought it
| Column | Example Values |
|---|---|
customer_key | 90123 |
customer_name | Jane Smith |
loyalty_tier | Gold |
signup_date | 2022-01-10 |
city | Wellington |
The star shape
dim_date βββββββ
β
dim_store ββββ fact_sales ββββ dim_product
β
dim_customer βββ
Every query follows the same pattern: start from the fact table, join to dimensions for context.
-- Total sales by region and month
SELECT
s.region,
d.month_name,
SUM(f.total_amount) AS total_revenue
FROM fact_sales f
JOIN dim_store s ON f.store_key = s.store_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2026
GROUP BY s.region, d.month_name
ORDER BY total_revenue DESC
Why denormalize?
In transactional databases (OLTP), you normalise data to reduce redundancy β separate tables linked by foreign keys. In analytical databases (OLAP), you denormalize β merge related attributes into fewer, wider tables.
Normalization vs denormalization
| Approach | Normalised (OLTP) | Denormalised (OLAP / Star Schema) |
|---|---|---|
| Goal | Eliminate redundancy | Optimise query performance |
| Tables | Many narrow tables with joins | Fewer wide tables with redundant attributes |
| Joins needed | Many (complex queries) | Few (simple queries) |
| Write performance | Fast (each fact stored once) | Slower (redundant data updated in multiple places) |
| Read performance | Slower (many joins) | Faster (fewer joins, columnar storage) |
| Best for | Transaction processing | Reporting and analytics |
In the FreshCart example, dim_store already includes city and region β even though a normalised design would have separate city and region tables. This redundancy makes queries faster because you join fact_sales to dim_store once instead of joining to store, then city, then region.
Exam tip: When to denormalize
The exam tests your judgment on denormalization. General rules:
- Denormalize dimension tables β flatten hierarchies into the dimension (city, region, country all in
dim_store) - Keep fact tables normalised β fact tables should contain only keys and measures
- Create aggregate tables for common high-level queries (daily totals instead of per-transaction)
- Do NOT denormalize when dimension attributes change frequently β updates ripple across all redundant copies
Aggregate tables
For very large fact tables, pre-computing common aggregations dramatically improves performance:
-- Create an aggregate table: daily sales by store
CREATE TABLE agg_daily_store_sales AS
SELECT
date_key,
store_key,
COUNT(*) AS transaction_count,
SUM(quantity) AS total_units,
SUM(total_amount) AS total_revenue,
SUM(discount_amount) AS total_discount
FROM fact_sales
GROUP BY date_key, store_key
| Query Level | Source | Rows (FreshCart) |
|---|---|---|
| Per transaction | fact_sales | 5.4 billion/year |
| Per day per store | agg_daily_store_sales | 730,000/year |
| Per month per region | Custom aggregate | 48/year |
Power BI and Direct Lake semantic models can automatically use aggregate tables when available (via user-defined aggregations).
Anita at FreshCart is designing a star schema for sales analytics. She has transaction-level data (one row per item per sale) plus reference data for products, stores, customers, and dates. How should she structure the schema?
Raj at Atlas Capital notices that Power BI dashboards showing 'total portfolio value by region' take 30+ seconds to load. The fact table has 2 billion rows at transaction-level grain. The query aggregates by region and month. What is the best optimisation?
π¬ Video coming soon
Next up: SQL Objects: Views, Functions & Stored Procedures β programmable SQL in Fabric warehouses and lakehouse SQL endpoints.