Loading Patterns: Full, Incremental & Streaming

Three loading patterns

Simple explanation

Think of three ways to update a photo album.

Full load — throw away all photos and reprint the entire album. Simple but wasteful. Fine if you have 50 photos.

Incremental load — only print new and changed photos, then slot them into the right places. Efficient but you need to know what changed.

Streaming — photos appear in the album the moment they’re taken, one by one, in real time. No waiting for a batch.

The right pattern depends on data volume, freshness requirements, and source system capabilities.

Full load (complete refresh)

Replace the entire target table with a complete copy of the source data. Every run processes ALL rows.

When to use full load

Scenario	Why Full Load Works
Small reference tables (countries, product categories)	Volume is tiny — no benefit from incremental
Source system has no change tracking	You can’t identify what changed, so you load everything
Data quality requires a clean slate	Accumulated incremental errors are wiped each load
Initial load for a new table	First load is always full

Implementation

# PySpark — full load (overwrite)
df_source = spark.read.format("jdbc") \
    .option("url", source_connection) \
    .option("dbtable", "DimProduct") \
    .load()

df_source.write.format("delta") \
    .mode("overwrite") \
    .save("Tables/DimProduct")

Pros: Simple, self-healing (bad data gets replaced), no change tracking needed. Cons: Slow for large tables, wastes compute/network, loses table history (overwrite).

Incremental load (changes only)

Load only rows that are new or changed since the last run. This requires a mechanism to detect changes.

Change detection methods

Choose the detection method based on what the source system supports
Method	How It Works	Best For
Watermark column	Filter on a timestamp or ID column: WHERE modified_date > @lastLoadDate	Sources with reliable modified_date or auto-increment ID
Change Data Capture (CDC) / Change Feed	Source system publishes inserts, updates, deletes as a change feed	Azure SQL (CDC), Cosmos DB (change feed) — databases with built-in change tracking
Delta change data feed	Delta table's own change log — tracks row-level changes between versions	When the source is already a Delta table in Fabric
File arrival date	Process files that arrived since the last run (based on file modified timestamp)	File-based sources (CSV, JSON drops in blob storage)
Mirroring	Fabric mirrors the source database automatically in near real-time	When you want zero-code incremental replication (covered in a later module)

Watermark pattern (most common)

# PySpark — incremental load with watermark
last_load_date = spark.sql("""
    SELECT MAX(load_timestamp) FROM lakehouse.load_history 
    WHERE table_name = 'FactOrders'
""").collect()[0][0]

df_new = spark.read.format("jdbc") \
    .option("url", source_connection) \
    .option("query", f"SELECT * FROM Orders WHERE modified_date > '{last_load_date}'") \
    .load()

# MERGE — update existing, insert new
df_new.createOrReplaceTempView("source_orders")
spark.sql("""
    MERGE INTO lakehouse.FactOrders AS target
    USING source_orders AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

Scenario: Carlos's incremental pipeline

Precision Manufacturing’s FactProduction table has 500 million rows. Full loading it takes 3 hours and costs significant compute.

Carlos switches to incremental loading using a watermark column (last_modified):

Pipeline reads the last successful load timestamp from a control table
Queries SAP for rows where last_modified > @lastLoadTimestamp
Uses MERGE to upsert into the Delta table
Updates the control table with the new timestamp

Result: daily loads drop from 3 hours to 12 minutes (processing ~200,000 changed rows instead of 500 million).

Streaming load (continuous)

Data flows into Fabric as it’s generated — no waiting for a scheduled batch.

Streaming options in Fabric

Tool	Latency	Best For
Eventstreams	Seconds	Event Hub, Kafka, IoT Hub sources → KQL database, lakehouse, custom endpoints
Spark Structured Streaming	Seconds to minutes	Complex transformations on streaming data in notebooks
Mirroring	Near real-time (minutes)	Database replication without custom code

Streaming to a lakehouse

Spark Structured Streaming writes to Delta tables in a lakehouse using micro-batches:

# PySpark — read from Event Hub, write to Delta table
stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eh_config) \
    .load()

# Parse and transform
parsed = stream_df.select(
    from_json(col("body").cast("string"), schema).alias("data")
).select("data.*")

# Write as streaming Delta table
parsed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/orders") \
    .start("Tables/FactOrders_Streaming")

Exam tip: Checkpoint location

Every streaming write needs a checkpoint location — a folder where Spark tracks which data has been processed. Without it, restarting the stream would reprocess everything from the beginning.

Exam pattern: “Streaming notebook restarts and reprocesses all data” → missing or incorrect checkpoint location.

Handling late-arriving data

In real-world systems, data doesn’t always arrive on time. An order placed at 11:55 PM might not reach the data platform until 1:30 AM the next day.

Strategy	How It Works	Trade-off
Watermark with buffer	Load data where `modified_date > @lastLoad - 2 hours` (overlap window)	Simple; may reprocess some rows (MERGE handles deduplication)
Event time vs processing time	Use the event’s business timestamp, not when it arrived	Requires the source to include a reliable business timestamp
Streaming watermark	Spark watermark drops data arriving more than N minutes late	Balances completeness vs memory usage
Reprocessing window	Re-run the pipeline for the previous 2 days every night	Catches late arrivals but increases compute cost

Scenario: Anika's late-arriving orders

ShopStream processes orders from 6 payment gateways. Some gateways batch-send confirmations with a 2-hour delay. Anika’s solution:

Incremental load uses a watermark with a 2-hour buffer: WHERE order_date > @lastLoad - INTERVAL 2 HOURS
MERGE handles deduplication — if an order was already loaded, it updates instead of inserting a duplicate
For the streaming pipeline, she sets a Spark watermark: .withWatermark("event_time", "2 hours") — events arriving more than 2 hours late are dropped

The buffer window means some rows are processed twice, but MERGE ensures the table stays clean.

Choosing the right pattern

Factor	Full	Incremental	Streaming
Data volume	Small (under 1M rows)	Large (millions+)	Continuous flow
Freshness requirement	Daily/weekly is fine	Daily/hourly	Seconds to minutes
Source supports CDC?	Doesn’t matter	Strongly preferred	Required for streaming
Complexity	Low	Medium	High
Compute cost	High per run (but simple)	Low per run	Continuous (always running)

Question

What is a watermark column in incremental loading?

Click or press Enter to reveal answer

Answer

A column (usually a timestamp like modified_date or an auto-increment ID) used to identify which rows have changed since the last load. The pipeline filters: WHERE modified_date > @lastLoadDate.

Click to flip back

Question

Why does streaming to Delta require a checkpoint location?

Click or press Enter to reveal answer

Answer

The checkpoint tracks which data has been processed. Without it, restarting the stream would reprocess everything from the beginning. The checkpoint folder stores offsets, committed batch IDs, and watermark state.

Click to flip back

Question

How does a watermark buffer handle late-arriving data?

Click or press Enter to reveal answer

Answer

Instead of loading WHERE date > @lastLoad, you use WHERE date > @lastLoad - 2 hours. This overlapping window catches data that arrived late. MERGE prevents duplicates — if a row was already loaded, it updates instead of inserting again.

Click to flip back

Knowledge Check

Precision Manufacturing's DimProduct table has 5,000 rows. It changes infrequently (a few products added per week). Carlos needs to keep the lakehouse copy in sync. Which loading pattern is most appropriate?

Knowledge Check

A streaming pipeline writes orders to a Delta table. After a Spark cluster restart, the pipeline reprocesses all orders from the beginning, creating duplicates. What is the most likely cause?

Knowledge Check

Anika's daily incremental pipeline loads orders using a watermark. She notices that orders placed at 11:50 PM are sometimes missing from the next day's load (which filters for modified_date > yesterday 12:00 AM). What should she change?

Next up: Dimensional Modeling: Prep for Analytics — design star schemas and slowly changing dimensions for your lakehouse.