Analytical Data Stores: Data Lakes, Warehouses & Lakehouses…

Where does analytical data live?

Simple explanation

Think of three ways to organise a library.

A data lake is a giant storage room where every book, magazine, photo, and scrap of paper goes in — completely unsorted, but nothing is thrown away. A data warehouse is a curated bookshop — every book is catalogued, shelved by topic, and easy to find. A lakehouse combines both ideas — it keeps everything in the storage room, but adds a catalogue system so you can search and browse it like a bookshop.

The key choice: do you want to organise data before storing it (warehouse), keep it raw and organise later (lake), or do both at once (lakehouse)?

Data lakes

A data lake stores raw data in its original format — structured, semi-structured, and unstructured — with no upfront schema.

How it works:

Data arrives as-is (CSV, JSON, Parquet, images, logs)
Stored in a distributed file system (Azure Data Lake Storage Gen2)
Schema is applied at read time (“schema-on-read”), not at write time
Processing engines (Spark, Fabric) query the data directly

Strengths: Cheap storage, any data type, preserves raw data, flexible Weaknesses: No built-in query optimisation, can become a “data swamp” without governance

In Azure: Azure Data Lake Storage Gen2 (ADLS Gen2) — Blob Storage with hierarchical namespace enabled.

Data warehouses

A data warehouse stores clean, structured, historical data optimised for analytical queries.

How it works:

Data is transformed and validated BEFORE loading (schema-on-write)
Organised in a star schema or snowflake schema (fact tables + dimension tables)
Optimised for SQL aggregate queries (SUM, COUNT, AVG across millions of rows)
Serves as the “single source of truth” for business reporting

Strengths: Fast queries, strong data quality, familiar SQL interface Weaknesses: Rigid schema (changes require migration), only structured data, higher cost

In Azure: Fabric Data Warehouse, or dedicated SQL pools.

Lakehouses

A lakehouse combines the flexibility of a data lake with the query performance of a data warehouse.

How it works:

Raw data stored in open formats (Parquet, Delta Lake) in a data lake
A metadata/transaction layer (Delta Lake format) adds ACID transactions, versioning, and schema enforcement
SQL query engine sits on top for warehouse-style queries
Same storage serves both data engineering (raw processing) and data analytics (SQL queries)

Strengths: Best of both worlds — raw data preservation plus fast structured queries Weaknesses: Newer technology, requires understanding of both paradigms

In Azure: Microsoft Fabric Lakehouse (built on OneLake + Delta Lake format).

Data lake vs data warehouse vs lakehouse
Feature	Data Lake	Data Warehouse	Lakehouse
Data format	Any (raw files)	Structured (tables)	Both (raw + structured)
Schema	On read	On write	Flexible (supports both)
Query engine	Spark, external tools	Built-in SQL engine	SQL + Spark
Data quality	Varies (raw data)	High (pre-validated)	Managed by Delta Lake
Best for	Raw data storage, data science	Business reporting, SQL analytics	Modern analytics (all-in-one)
Azure service	ADLS Gen2	Fabric Warehouse	Fabric Lakehouse

Star schema — the warehouse pattern

Data warehouses organise data in a star schema:

Fact table (centre): Stores measurable events — sales transactions, deliveries, page views. Each row has numeric measures (revenue, quantity) and foreign keys to dimension tables.
Dimension tables (points of the star): Stores descriptive context — dates, products, stores, customers.

Priya’s star schema at FreshMart:

Fact: Sales (date_key, store_key, product_key, quantity, revenue)
Dimensions: Date, Store, Product, Customer

Why star? Because queries join the fact table (centre) to dimension tables (points), forming a star shape. It’s optimised for “total sales by store by month” queries.

Exam tip: choosing the analytical store

“Store raw data from many sources, any format” → Data lake
“Structured data for SQL reporting and dashboards” → Data warehouse
“Need both raw storage and SQL queries in one platform” → Lakehouse
“Star schema with fact and dimension tables” → Data warehouse (or lakehouse)
“Schema-on-read” → Data lake
“Schema-on-write” → Data warehouse

Flashcards

Question

What is a data lake?

Click or press Enter to reveal answer

Answer

A storage repository that holds raw data in its original format (any file type). Uses schema-on-read — structure is applied when data is queried, not when it's stored. In Azure: ADLS Gen2.

Click to flip back

Question

What is a data warehouse?

Click or press Enter to reveal answer

Answer

A structured, optimised store for historical analytical data. Uses schema-on-write (data must conform to a defined structure before loading). Organised in star/snowflake schemas for fast SQL queries.

Click to flip back

Question

What is a lakehouse?

Click or press Enter to reveal answer

Answer

A hybrid combining data lake storage (raw files, any format) with data warehouse capabilities (SQL queries, ACID transactions). Uses Delta Lake format for structure. Microsoft Fabric uses this approach.

Click to flip back

Knowledge check

Knowledge Check

FreshMart receives data from 50 store POS systems (structured), supplier JSON feeds (semi-structured), and customer feedback emails (unstructured). They want to store all of it in one place before processing. What should they use?

Knowledge Check

Priya needs fast SQL queries over cleaned, validated sales data organised in a star schema with fact and dimension tables. Which analytical store pattern is this?

Next up: Microsoft Fabric & Azure Databricks — the cloud platforms that bring analytics to life.