Analytical Data Stores: Data Lakes, Warehouses & Lakehouses
Data lakes store everything raw. Data warehouses structure it for queries. Lakehouses combine both. Learn the three approaches to storing analytical data.
Where does analytical data live?
Think of three ways to organise a library.
A data lake is a giant storage room where every book, magazine, photo, and scrap of paper goes in — completely unsorted, but nothing is thrown away. A data warehouse is a curated bookshop — every book is catalogued, shelved by topic, and easy to find. A lakehouse combines both ideas — it keeps everything in the storage room, but adds a catalogue system so you can search and browse it like a bookshop.
The key choice: do you want to organise data before storing it (warehouse), keep it raw and organise later (lake), or do both at once (lakehouse)?
Data lakes
A data lake stores raw data in its original format — structured, semi-structured, and unstructured — with no upfront schema.
How it works:
- Data arrives as-is (CSV, JSON, Parquet, images, logs)
- Stored in a distributed file system (Azure Data Lake Storage Gen2)
- Schema is applied at read time (“schema-on-read”), not at write time
- Processing engines (Spark, Fabric) query the data directly
Strengths: Cheap storage, any data type, preserves raw data, flexible Weaknesses: No built-in query optimisation, can become a “data swamp” without governance
In Azure: Azure Data Lake Storage Gen2 (ADLS Gen2) — Blob Storage with hierarchical namespace enabled.
Data warehouses
A data warehouse stores clean, structured, historical data optimised for analytical queries.
How it works:
- Data is transformed and validated BEFORE loading (schema-on-write)
- Organised in a star schema or snowflake schema (fact tables + dimension tables)
- Optimised for SQL aggregate queries (SUM, COUNT, AVG across millions of rows)
- Serves as the “single source of truth” for business reporting
Strengths: Fast queries, strong data quality, familiar SQL interface Weaknesses: Rigid schema (changes require migration), only structured data, higher cost
In Azure: Fabric Data Warehouse, or dedicated SQL pools.
Lakehouses
A lakehouse combines the flexibility of a data lake with the query performance of a data warehouse.
How it works:
- Raw data stored in open formats (Parquet, Delta Lake) in a data lake
- A metadata/transaction layer (Delta Lake format) adds ACID transactions, versioning, and schema enforcement
- SQL query engine sits on top for warehouse-style queries
- Same storage serves both data engineering (raw processing) and data analytics (SQL queries)
Strengths: Best of both worlds — raw data preservation plus fast structured queries Weaknesses: Newer technology, requires understanding of both paradigms
In Azure: Microsoft Fabric Lakehouse (built on OneLake + Delta Lake format).
| Feature | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Data format | Any (raw files) | Structured (tables) | Both (raw + structured) |
| Schema | On read | On write | Flexible (supports both) |
| Query engine | Spark, external tools | Built-in SQL engine | SQL + Spark |
| Data quality | Varies (raw data) | High (pre-validated) | Managed by Delta Lake |
| Best for | Raw data storage, data science | Business reporting, SQL analytics | Modern analytics (all-in-one) |
| Azure service | ADLS Gen2 | Fabric Warehouse | Fabric Lakehouse |
Star schema — the warehouse pattern
Data warehouses organise data in a star schema:
- Fact table (centre): Stores measurable events — sales transactions, deliveries, page views. Each row has numeric measures (revenue, quantity) and foreign keys to dimension tables.
- Dimension tables (points of the star): Stores descriptive context — dates, products, stores, customers.
Priya’s star schema at FreshMart:
- Fact: Sales (date_key, store_key, product_key, quantity, revenue)
- Dimensions: Date, Store, Product, Customer
Why star? Because queries join the fact table (centre) to dimension tables (points), forming a star shape. It’s optimised for “total sales by store by month” queries.
Exam tip: choosing the analytical store
- “Store raw data from many sources, any format” → Data lake
- “Structured data for SQL reporting and dashboards” → Data warehouse
- “Need both raw storage and SQL queries in one platform” → Lakehouse
- “Star schema with fact and dimension tables” → Data warehouse (or lakehouse)
- “Schema-on-read” → Data lake
- “Schema-on-write” → Data warehouse
Flashcards
Knowledge check
FreshMart receives data from 50 store POS systems (structured), supplier JSON feeds (semi-structured), and customer feedback emails (unstructured). They want to store all of it in one place before processing. What should they use?
Priya needs fast SQL queries over cleaned, validated sales data organised in a star schema with fact and dimension tables. Which analytical store pattern is this?
🎬 Video coming soon
Next up: Microsoft Fabric & Azure Databricks — the cloud platforms that bring analytics to life.