Lakehouses: Your Data Foundation
The lakehouse combines the flexibility of a data lake with the structure of a warehouse. Learn how Delta tables, SQL endpoints, and Spark notebooks work together.
What is a lakehouse?
Think of a lakehouse as a smart filing cabinet.
A traditional data lake is like dumping all your papers into a big box β flexible, cheap, but impossible to search. A data warehouse is like a perfectly organised filing cabinet β structured, fast to query, but rigid and expensive to change.
A lakehouse is a filing cabinet that organises itself. You can dump raw files in (like a lake), and the system automatically catalogues them into structured tables (like a warehouse). You get the best of both worlds: flexible storage with structured querying.
In Fabric, the lakehouse stores your data as Delta tables in OneLake. You can query them with both Spark (for big transformations) and SQL (for quick analysis).
Why lakehouses matter in Fabric
The lakehouse is the most common starting point for DP-600 scenarios. When data enters Fabric, it almost always lands in a lakehouse first.
Two endpoints, one storage
Every lakehouse gives you two ways to access the same data:
| Endpoint | Engine | Read/Write | Best For |
|---|---|---|---|
| Spark endpoint | Apache Spark | Read + Write | Heavy transformations, Python/PySpark notebooks, Spark jobs |
| SQL analytics endpoint | Fabric SQL | Read only | Quick SQL queries, Power BI connections, ad-hoc analysis |
The SQL analytics endpoint is automatically generated β you do not need to create it. When you add a Delta table to your lakehouse, a corresponding SQL view appears in the SQL endpoint within minutes.
Scenario: Anita's daily sales pipeline
Anita at FreshCart receives daily CSV files from 2,000 stores β about 15 million rows per day. She loads them into a lakehouse using a Fabric pipeline, transforms them with a PySpark notebook (cleaning nulls, standardising product codes), and writes the results as Delta tables.
Her Power BI report connects to the SQL analytics endpoint β it does not need Spark to run. The finance team queries the same data using SQL. Both read the same Delta files in OneLake.
Delta tables: The heart of the lakehouse
Delta tables are not just Parquet files. They add critical capabilities:
ACID transactions
Every write to a Delta table is atomic β either the entire write succeeds or nothing changes. No more half-loaded tables after a pipeline failure.
Schema enforcement
Delta tables reject writes that do not match the tableβs schema. If your pipeline tries to insert a string into an integer column, the write fails β protecting data quality.
Schema evolution
You can add new columns to a Delta table without rewriting existing data. Old rows get null for the new column; new rows include the value.
Time travel
Delta tables keep a transaction log of every change. You can query a table as it looked at a specific point in time:
-- What did the sales table look like yesterday?
SELECT * FROM sales TIMESTAMP AS OF '2026-04-20T08:00:00Z'
Optimisation commands
| Command | What It Does |
|---|---|
OPTIMIZE | Compacts small files into larger ones for faster reads |
VACUUM | Removes old files that are no longer referenced by the transaction log |
Z-ORDER BY | Co-locates related data on disk for faster filtered queries |
Exam tip: OPTIMIZE vs VACUUM
The exam tests whether you know the difference. OPTIMIZE reorganises files for performance β it creates new, larger files but keeps the old ones for time travel. VACUUM deletes old files that are outside the retention period (default 7 days). Running VACUUM too aggressively breaks time travel for the deleted versions.
Common trap: βWhich command improves query performance?β β OPTIMIZE (not VACUUM). VACUUM saves storage space.
The medallion architecture
The medallion architecture (Bronze β Silver β Gold) is the recommended pattern for organising lakehouse data:
| Layer | Purpose | Data Quality | Example (FreshCart) |
|---|---|---|---|
| Bronze | Raw ingestion β land data exactly as received | Raw, untransformed | Raw CSV from POS systems, supplier feeds |
| Silver | Cleaned and standardised β deduplicated, typed, validated | Cleaned, conformed | Deduplicated transactions, standardised product codes |
| Gold | Business-ready β aggregated, modelled for reporting | Curated, trusted | Daily sales by store, product category summaries |
You can implement all three layers in a single lakehouse (using naming conventions like bronze_sales, silver_sales, gold_sales) or across multiple lakehouses in the same workspace.
Scenario: Anita's medallion layers
Anitaβs FreshCart lakehouse uses three schemas:
- Bronze:
raw_pos_transactions(15M rows/day, raw CSV loaded as-is) - Silver:
clean_transactions(deduplicated, nulls handled, product codes standardised, dates parsed) - Gold:
daily_store_sales(aggregated by store and date),product_performance(aggregated by product and week)
Her Power BI semantic model connects to Gold tables only. Analysts who need raw data can query Silver. Nobody touches Bronze directly β it exists for auditability and reprocessing.
Lakehouse vs the SQL analytics endpoint
A common point of confusion: the lakehouseβs SQL analytics endpoint is NOT the same as a Fabric Warehouse.
| Feature | Lakehouse SQL Endpoint | Fabric Warehouse |
|---|---|---|
| Write access | Read-only β you cannot INSERT/UPDATE via SQL | Full read-write β INSERT, UPDATE, DELETE, MERGE |
| How tables are created | Auto-generated from Delta tables in the lakehouse | Created via SQL DDL (CREATE TABLE) |
| Primary use case | Quick SQL queries, Power BI connections | Complex SQL workloads, stored procedures, full DML |
| Engine | Fabric SQL engine (read-only mode) | Fabric SQL engine (full mode) |
| Spark access | Yes β the underlying Delta tables are accessible via Spark | No Spark access β data is in the warehouse format |
| Best for | Data engineers who transform with Spark, then serve via SQL | SQL-first teams who prefer writing SQL for everything |
Anita at FreshCart loads 15 million rows of daily POS data into a lakehouse using a PySpark notebook. Her finance team wants to query this data using SQL. What should she tell them?
After running daily ingestion for a month, Anita notices her lakehouse queries are slowing down. Investigation shows thousands of small Parquet files (1-5 MB each) in her Delta tables. Which command should she run first?
Dr. Sarah at Pacific Health Network discovers that a bug in her pipeline corrupted yesterday's patient data load. She needs to restore the Delta table to its state before the bad load. Which Delta feature enables this?
π¬ Video coming soon
Next up: Warehouses in Fabric β when SQL-first teams need full read-write access with T-SQL stored procedures.