Lakehouses: Your Data Foundation

What is a lakehouse?

Simple explanation

Think of a lakehouse as a smart filing cabinet.

A traditional data lake is like dumping all your papers into a big box — flexible, cheap, but impossible to search. A data warehouse is like a perfectly organised filing cabinet — structured, fast to query, but rigid and expensive to change.

A lakehouse is a filing cabinet that organises itself. You can dump raw files in (like a lake), and the system automatically catalogues them into structured tables (like a warehouse). You get the best of both worlds: flexible storage with structured querying.

In Fabric, the lakehouse stores your data as Delta tables in OneLake. You can query them with both Spark (for big transformations) and SQL (for quick analysis).

Why lakehouses matter in Fabric

The lakehouse is the most common starting point for DP-600 scenarios. When data enters Fabric, it almost always lands in a lakehouse first.

Two endpoints, one storage

Every lakehouse gives you two ways to access the same data:

Endpoint	Engine	Read/Write	Best For
Spark endpoint	Apache Spark	Read + Write	Heavy transformations, Python/PySpark notebooks, Spark jobs
SQL analytics endpoint	Fabric SQL	Read only	Quick SQL queries, Power BI connections, ad-hoc analysis

The SQL analytics endpoint is automatically generated — you do not need to create it. When you add a Delta table to your lakehouse, a corresponding SQL view appears in the SQL endpoint within minutes.

Scenario: Anita's daily sales pipeline

Anita at FreshCart receives daily CSV files from 2,000 stores — about 15 million rows per day. She loads them into a lakehouse using a Fabric pipeline, transforms them with a PySpark notebook (cleaning nulls, standardising product codes), and writes the results as Delta tables.

Her Power BI report connects to the SQL analytics endpoint — it does not need Spark to run. The finance team queries the same data using SQL. Both read the same Delta files in OneLake.

Delta tables: The heart of the lakehouse

Delta tables are not just Parquet files. They add critical capabilities:

ACID transactions

Every write to a Delta table is atomic — either the entire write succeeds or nothing changes. No more half-loaded tables after a pipeline failure.

Schema enforcement

Delta tables reject writes that do not match the table’s schema. If your pipeline tries to insert a string into an integer column, the write fails — protecting data quality.

Schema evolution

You can add new columns to a Delta table without rewriting existing data. Old rows get null for the new column; new rows include the value.

Time travel

Delta tables keep a transaction log of every change. You can query a table as it looked at a specific point in time:

-- What did the sales table look like yesterday?
SELECT * FROM sales TIMESTAMP AS OF '2026-04-20T08:00:00Z'

Optimisation commands

Command	What It Does
`OPTIMIZE`	Compacts small files into larger ones for faster reads
`VACUUM`	Removes old files that are no longer referenced by the transaction log
`Z-ORDER BY`	Co-locates related data on disk for faster filtered queries

Exam tip: OPTIMIZE vs VACUUM

The exam tests whether you know the difference. OPTIMIZE reorganises files for performance — it creates new, larger files but keeps the old ones for time travel. VACUUM deletes old files that are outside the retention period (default 7 days). Running VACUUM too aggressively breaks time travel for the deleted versions.

Common trap: “Which command improves query performance?” → OPTIMIZE (not VACUUM). VACUUM saves storage space.

The medallion architecture

The medallion architecture (Bronze → Silver → Gold) is the recommended pattern for organising lakehouse data:

Layer	Purpose	Data Quality	Example (FreshCart)
Bronze	Raw ingestion — land data exactly as received	Raw, untransformed	Raw CSV from POS systems, supplier feeds
Silver	Cleaned and standardised — deduplicated, typed, validated	Cleaned, conformed	Deduplicated transactions, standardised product codes
Gold	Business-ready — aggregated, modelled for reporting	Curated, trusted	Daily sales by store, product category summaries

You can implement all three layers in a single lakehouse (using naming conventions like bronze_sales, silver_sales, gold_sales) or across multiple lakehouses in the same workspace.

Scenario: Anita's medallion layers

Anita’s FreshCart lakehouse uses three schemas:

Bronze: raw_pos_transactions (15M rows/day, raw CSV loaded as-is)
Silver: clean_transactions (deduplicated, nulls handled, product codes standardised, dates parsed)
Gold: daily_store_sales (aggregated by store and date), product_performance (aggregated by product and week)

Her Power BI semantic model connects to Gold tables only. Analysts who need raw data can query Silver. Nobody touches Bronze directly — it exists for auditability and reprocessing.

Lakehouse vs the SQL analytics endpoint

A common point of confusion: the lakehouse’s SQL analytics endpoint is NOT the same as a Fabric Warehouse.

The SQL analytics endpoint is a read-only SQL view of lakehouse data — not a warehouse
Feature	Lakehouse SQL Endpoint	Fabric Warehouse
Write access	Read-only — you cannot INSERT/UPDATE via SQL	Full read-write — INSERT, UPDATE, DELETE, MERGE
How tables are created	Auto-generated from Delta tables in the lakehouse	Created via SQL DDL (CREATE TABLE)
Primary use case	Quick SQL queries, Power BI connections	Complex SQL workloads, stored procedures, full DML
Engine	Fabric SQL engine (read-only mode)	Fabric SQL engine (full mode)
Spark access	Yes — the underlying Delta tables are accessible via Spark	No Spark access — data is in the warehouse format
Best for	Data engineers who transform with Spark, then serve via SQL	SQL-first teams who prefer writing SQL for everything

Question

What two endpoints does every Fabric lakehouse include?

Click or press Enter to reveal answer

Answer

1. Spark endpoint — read/write access using PySpark, Scala, or R notebooks. 2. SQL analytics endpoint — read-only SQL access to Delta tables. Both endpoints read the same underlying Delta files in OneLake.

Click to flip back

Question

What is the medallion architecture?

Click or press Enter to reveal answer

Answer

A data organisation pattern with three layers: Bronze (raw ingestion), Silver (cleaned/standardised), and Gold (business-ready aggregates). Each layer improves data quality. All layers live in lakehouses as Delta tables.

Click to flip back

Question

What does the OPTIMIZE command do in a Delta lakehouse?

Click or press Enter to reveal answer

Answer

OPTIMIZE compacts many small Parquet files into fewer, larger files. This improves read performance because queries scan fewer files. It does NOT delete old files — that is VACUUM's job.

Click to flip back

Question

Can you run INSERT statements through the lakehouse SQL analytics endpoint?

Click or press Enter to reveal answer

Answer

No. The SQL analytics endpoint is read-only. To write data, use Spark notebooks, Spark jobs, Dataflows Gen2, or pipelines. For full SQL DML (INSERT, UPDATE, DELETE, MERGE), use a Fabric Warehouse instead.

Click to flip back

Knowledge Check

Anita at FreshCart loads 15 million rows of daily POS data into a lakehouse using a PySpark notebook. Her finance team wants to query this data using SQL. What should she tell them?

Knowledge Check

After running daily ingestion for a month, Anita notices her lakehouse queries are slowing down. Investigation shows thousands of small Parquet files (1-5 MB each) in her Delta tables. Which command should she run first?

Knowledge Check

Dr. Sarah at Pacific Health Network discovers that a bug in her pipeline corrupted yesterday's patient data load. She needs to restore the Delta table to its state before the bad load. Which Delta feature enables this?

Next up: Warehouses in Fabric — when SQL-first teams need full read-write access with T-SQL stored procedures.