Delta Lake: The Heart of Fabric

What is Delta Lake?

Simple explanation

Think of a notebook with a built-in undo button, a table of contents, and a lock on the cover.

Old data lakes were just folders full of files — like loose papers on a desk. If two people edited the same file, you got corruption. If you needed last Tuesday’s version, tough luck. If someone wrote bad data, it mixed in with the good.

Delta Lake fixes all of this. It’s a storage format that wraps Parquet files with a transaction log. That log acts as the undo button (time travel), the table of contents (what files make up the current version), and the lock (only one writer at a time).

In Microsoft Fabric, every lakehouse table is a Delta table. You don’t choose Delta — it’s the default. Understanding Delta is understanding how your data is stored, versioned, and protected.

The transaction log

The transaction log is what makes Delta Lake different from plain Parquet. It’s a folder called _delta_log/ inside every Delta table, containing JSON files numbered sequentially.

How it works

Version	Log Entry	Effect
0	`00000000000000000000.json` — Add `part-0001.parquet`, `part-0002.parquet`	Initial table creation with 2 data files
1	`00000000000000000001.json` — Add `part-0003.parquet`	New data appended
2	`00000000000000000002.json` — Remove `part-0001.parquet`, Add `part-0004.parquet`	Update operation: old file replaced with new

Every 10 commits, Delta creates a checkpoint file (Parquet format) that summarises the current state — so readers don’t need to replay all JSON logs from the beginning.

Scenario: Carlos discovers the transaction log

Carlos at Precision Manufacturing investigates why a lakehouse table is using more storage than expected. He browses the OneLake file structure:

Tables/FactProduction/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   ├── ...
│   ├── 00000000000000000042.json
│   └── 00000000000000000040.checkpoint.parquet
├── part-00000-...snappy.parquet
├── part-00001-...snappy.parquet
└── ... (hundreds of Parquet files)

He realises that old versions of data files still exist on disk — Delta marks them as “removed” in the log but doesn’t physically delete them (that’s what VACUUM does — covered in the optimisation module).

ACID transactions

Delta Lake provides full ACID guarantees — something traditional data lakes couldn’t offer.

Property	What It Means	Example
Atomicity	A write either fully succeeds or fully fails — no partial writes	A notebook crashes mid-write → the table stays in its previous valid state
Consistency	Schema enforcement prevents bad data from entering	A CSV with a missing column is rejected before any rows are written
Isolation	Concurrent readers see a consistent snapshot while writers are active	A BI report queries the table while an ETL job is writing — the report sees the previous complete version
Durability	Once committed, data survives system failures	A power outage after commit → data is safe in OneLake

Why ACID matters for data engineering

Without ACID, a failed write can leave your table in a half-updated state. Imagine loading 10 million rows — if the job crashes at row 5 million, you have a table with incomplete data. Readers might see inconsistent results. You’d need to manually clean up and retry.

With Delta Lake’s ACID transactions, the partially written files exist on disk but are NOT registered in the transaction log. The table stays at its previous version. When you retry the load, it starts fresh.

Time travel

Every commit creates a new version. You can query any previous version by number or timestamp.

Query by version

-- Spark SQL
SELECT * FROM lakehouse.FactSales VERSION AS OF 5

SELECT * FROM delta.`/Tables/FactSales` VERSION AS OF 5

Note: VERSION AS OF is supported in Spark SQL. The lakehouse SQL analytics endpoint does not support time travel queries.

Query by timestamp

# PySpark
df = spark.read.format("delta") \
    .option("timestampAsOf", "2026-04-20T10:00:00") \
    .load("/Tables/FactSales")

Common time travel use cases

Use Case	How
Audit — “What did the table look like last Tuesday?”	Query by timestamp
Debug — “What changed in the last load?”	Compare version N vs N-1
Rollback — “The last load was bad, restore the previous version”	`RESTORE TABLE FactSales TO VERSION AS OF 41`
Reproducibility — “Rerun yesterday’s report with yesterday’s data”	Query by timestamp matching the report date

Scenario: Anika rolls back a bad load

At ShopStream, Anika’s overnight pipeline loads order data. On Wednesday morning, the analytics team reports that revenue numbers are doubled. Anika investigates:

She queries the table’s history: DESCRIBE HISTORY lakehouse.FactOrders — version 87 was the bad load
She checks what version 86 looked like: SELECT COUNT(*) FROM lakehouse.FactOrders VERSION AS OF 86 — correct count
She restores: RESTORE TABLE lakehouse.FactOrders TO VERSION AS OF 86
Revenue numbers return to normal immediately

Total downtime: 4 minutes. Without Delta time travel, she’d need to rebuild the table from raw files — potentially hours.

Schema enforcement and evolution

Schema enforcement (write-side)

Delta Lake rejects writes that don’t match the existing table schema. If your table has columns (id INT, name STRING, amount DECIMAL) and you try to write data with an extra discount column, the write fails.

This prevents schema drift — the silent corruption where unexpected columns sneak into your data.

Schema evolution (controlled changes)

When you intentionally want to add columns, you enable mergeSchema:

df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/Tables/FactOrders")

Existing rows get NULL for the new column. Future reads include it.

Enforcement protects you; evolution lets you grow
Feature	Schema Enforcement	Schema Evolution
Default behaviour	ON — rejects mismatched schemas	OFF — must explicitly enable
Purpose	Prevent accidental schema changes	Allow intentional schema changes
How to override	Cannot disable (by design)	Set mergeSchema=true on the write operation
When it applies	Every write operation	Only when mergeSchema is enabled
Exam pattern	'Write fails with schema mismatch' — expected behaviour	'New column needs to be added to existing table' — use mergeSchema

Delta MERGE (upsert)

MERGE is one of the most important Delta Lake operations for data engineering. It combines INSERT, UPDATE, and DELETE in a single atomic operation.

MERGE INTO target_table AS target
USING source_data AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.amount = source.amount
WHEN NOT MATCHED THEN INSERT (id, name, amount) VALUES (source.id, source.name, source.amount)

Use cases: incremental data loads, slowly changing dimensions (SCD), deduplication.

Exam tip: MERGE vs INSERT OVERWRITE

The exam frequently tests the choice between MERGE and other write patterns:

Full load (replace everything): INSERT OVERWRITE or .mode("overwrite")
Append new rows only: .mode("append")
Update existing + insert new (upsert): MERGE
Delete + insert (SCD Type 1): MERGE with WHEN MATCHED THEN UPDATE
Insert new version + expire old (SCD Type 2): MERGE with additional logic

If the scenario mentions “update existing records and add new ones,” the answer is almost always MERGE.

Question

What is the Delta Lake transaction log?