Delta Lake: The Heart of Fabric
Understand Delta Lake — the open-source storage layer that powers every lakehouse table in Fabric. ACID transactions, time travel, schema enforcement, and the transaction log explained.
What is Delta Lake?
Think of a notebook with a built-in undo button, a table of contents, and a lock on the cover.
Old data lakes were just folders full of files — like loose papers on a desk. If two people edited the same file, you got corruption. If you needed last Tuesday’s version, tough luck. If someone wrote bad data, it mixed in with the good.
Delta Lake fixes all of this. It’s a storage format that wraps Parquet files with a transaction log. That log acts as the undo button (time travel), the table of contents (what files make up the current version), and the lock (only one writer at a time).
In Microsoft Fabric, every lakehouse table is a Delta table. You don’t choose Delta — it’s the default. Understanding Delta is understanding how your data is stored, versioned, and protected.
The transaction log
The transaction log is what makes Delta Lake different from plain Parquet. It’s a folder called _delta_log/ inside every Delta table, containing JSON files numbered sequentially.
How it works
| Version | Log Entry | Effect |
|---|---|---|
| 0 | 00000000000000000000.json — Add part-0001.parquet, part-0002.parquet | Initial table creation with 2 data files |
| 1 | 00000000000000000001.json — Add part-0003.parquet | New data appended |
| 2 | 00000000000000000002.json — Remove part-0001.parquet, Add part-0004.parquet | Update operation: old file replaced with new |
Every 10 commits, Delta creates a checkpoint file (Parquet format) that summarises the current state — so readers don’t need to replay all JSON logs from the beginning.
Scenario: Carlos discovers the transaction log
Carlos at Precision Manufacturing investigates why a lakehouse table is using more storage than expected. He browses the OneLake file structure:
Tables/FactProduction/
├── _delta_log/
│ ├── 00000000000000000000.json
│ ├── 00000000000000000001.json
│ ├── ...
│ ├── 00000000000000000042.json
│ └── 00000000000000000040.checkpoint.parquet
├── part-00000-...snappy.parquet
├── part-00001-...snappy.parquet
└── ... (hundreds of Parquet files)He realises that old versions of data files still exist on disk — Delta marks them as “removed” in the log but doesn’t physically delete them (that’s what VACUUM does — covered in the optimisation module).
ACID transactions
Delta Lake provides full ACID guarantees — something traditional data lakes couldn’t offer.
| Property | What It Means | Example |
|---|---|---|
| Atomicity | A write either fully succeeds or fully fails — no partial writes | A notebook crashes mid-write → the table stays in its previous valid state |
| Consistency | Schema enforcement prevents bad data from entering | A CSV with a missing column is rejected before any rows are written |
| Isolation | Concurrent readers see a consistent snapshot while writers are active | A BI report queries the table while an ETL job is writing — the report sees the previous complete version |
| Durability | Once committed, data survives system failures | A power outage after commit → data is safe in OneLake |
Why ACID matters for data engineering
Without ACID, a failed write can leave your table in a half-updated state. Imagine loading 10 million rows — if the job crashes at row 5 million, you have a table with incomplete data. Readers might see inconsistent results. You’d need to manually clean up and retry.
With Delta Lake’s ACID transactions, the partially written files exist on disk but are NOT registered in the transaction log. The table stays at its previous version. When you retry the load, it starts fresh.
Time travel
Every commit creates a new version. You can query any previous version by number or timestamp.
Query by version
-- Spark SQL
SELECT * FROM lakehouse.FactSales VERSION AS OF 5
SELECT * FROM delta.`/Tables/FactSales` VERSION AS OF 5
Note: VERSION AS OF is supported in Spark SQL. The lakehouse SQL analytics endpoint does not support time travel queries.
Query by timestamp
# PySpark
df = spark.read.format("delta") \
.option("timestampAsOf", "2026-04-20T10:00:00") \
.load("/Tables/FactSales")
Common time travel use cases
| Use Case | How |
|---|---|
| Audit — “What did the table look like last Tuesday?” | Query by timestamp |
| Debug — “What changed in the last load?” | Compare version N vs N-1 |
| Rollback — “The last load was bad, restore the previous version” | RESTORE TABLE FactSales TO VERSION AS OF 41 |
| Reproducibility — “Rerun yesterday’s report with yesterday’s data” | Query by timestamp matching the report date |
Scenario: Anika rolls back a bad load
At ShopStream, Anika’s overnight pipeline loads order data. On Wednesday morning, the analytics team reports that revenue numbers are doubled. Anika investigates:
- She queries the table’s history:
DESCRIBE HISTORY lakehouse.FactOrders— version 87 was the bad load - She checks what version 86 looked like:
SELECT COUNT(*) FROM lakehouse.FactOrders VERSION AS OF 86— correct count - She restores:
RESTORE TABLE lakehouse.FactOrders TO VERSION AS OF 86 - Revenue numbers return to normal immediately
Total downtime: 4 minutes. Without Delta time travel, she’d need to rebuild the table from raw files — potentially hours.
Schema enforcement and evolution
Schema enforcement (write-side)
Delta Lake rejects writes that don’t match the existing table schema. If your table has columns (id INT, name STRING, amount DECIMAL) and you try to write data with an extra discount column, the write fails.
This prevents schema drift — the silent corruption where unexpected columns sneak into your data.
Schema evolution (controlled changes)
When you intentionally want to add columns, you enable mergeSchema:
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/Tables/FactOrders")
Existing rows get NULL for the new column. Future reads include it.
| Feature | Schema Enforcement | Schema Evolution |
|---|---|---|
| Default behaviour | ON — rejects mismatched schemas | OFF — must explicitly enable |
| Purpose | Prevent accidental schema changes | Allow intentional schema changes |
| How to override | Cannot disable (by design) | Set mergeSchema=true on the write operation |
| When it applies | Every write operation | Only when mergeSchema is enabled |
| Exam pattern | 'Write fails with schema mismatch' — expected behaviour | 'New column needs to be added to existing table' — use mergeSchema |
Delta MERGE (upsert)
MERGE is one of the most important Delta Lake operations for data engineering. It combines INSERT, UPDATE, and DELETE in a single atomic operation.
MERGE INTO target_table AS target
USING source_data AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.amount = source.amount
WHEN NOT MATCHED THEN INSERT (id, name, amount) VALUES (source.id, source.name, source.amount)
Use cases: incremental data loads, slowly changing dimensions (SCD), deduplication.
Exam tip: MERGE vs INSERT OVERWRITE
The exam frequently tests the choice between MERGE and other write patterns:
- Full load (replace everything):
INSERT OVERWRITEor.mode("overwrite") - Append new rows only:
.mode("append") - Update existing + insert new (upsert):
MERGE - Delete + insert (SCD Type 1):
MERGEwith WHEN MATCHED THEN UPDATE - Insert new version + expire old (SCD Type 2):
MERGEwith additional logic
If the scenario mentions “update existing records and add new ones,” the answer is almost always MERGE.
A PySpark notebook crashes halfway through writing 10 million rows to a Delta table. What happens to the table?
Anika needs to add a 'discount_percentage' column to an existing Delta table that has 50 million rows. She writes a DataFrame with the new column. The write fails with a schema mismatch error. What should she do?
Carlos receives daily production data. Some records are updates to existing rows (changed status), others are new rows. He needs to update existing records and insert new ones in a single operation. Which Delta operation should he use?
🎬 Video coming soon
Next up: Loading Patterns: Full, Incremental & Streaming — design the right data loading strategy for every scenario.