Loading Patterns: Full, Incremental & Streaming
Design the right data loading strategy — full loads for small datasets, incremental for large ones, and streaming for real-time. Handle late-arriving data gracefully.
Three loading patterns
Think of three ways to update a photo album.
Full load — throw away all photos and reprint the entire album. Simple but wasteful. Fine if you have 50 photos.
Incremental load — only print new and changed photos, then slot them into the right places. Efficient but you need to know what changed.
Streaming — photos appear in the album the moment they’re taken, one by one, in real time. No waiting for a batch.
The right pattern depends on data volume, freshness requirements, and source system capabilities.
Full load (complete refresh)
Replace the entire target table with a complete copy of the source data. Every run processes ALL rows.
When to use full load
| Scenario | Why Full Load Works |
|---|---|
| Small reference tables (countries, product categories) | Volume is tiny — no benefit from incremental |
| Source system has no change tracking | You can’t identify what changed, so you load everything |
| Data quality requires a clean slate | Accumulated incremental errors are wiped each load |
| Initial load for a new table | First load is always full |
Implementation
# PySpark — full load (overwrite)
df_source = spark.read.format("jdbc") \
.option("url", source_connection) \
.option("dbtable", "DimProduct") \
.load()
df_source.write.format("delta") \
.mode("overwrite") \
.save("Tables/DimProduct")
Pros: Simple, self-healing (bad data gets replaced), no change tracking needed. Cons: Slow for large tables, wastes compute/network, loses table history (overwrite).
Incremental load (changes only)
Load only rows that are new or changed since the last run. This requires a mechanism to detect changes.
Change detection methods
| Method | How It Works | Best For |
|---|---|---|
| Watermark column | Filter on a timestamp or ID column: WHERE modified_date > @lastLoadDate | Sources with reliable modified_date or auto-increment ID |
| Change Data Capture (CDC) / Change Feed | Source system publishes inserts, updates, deletes as a change feed | Azure SQL (CDC), Cosmos DB (change feed) — databases with built-in change tracking |
| Delta change data feed | Delta table's own change log — tracks row-level changes between versions | When the source is already a Delta table in Fabric |
| File arrival date | Process files that arrived since the last run (based on file modified timestamp) | File-based sources (CSV, JSON drops in blob storage) |
| Mirroring | Fabric mirrors the source database automatically in near real-time | When you want zero-code incremental replication (covered in a later module) |
Watermark pattern (most common)
# PySpark — incremental load with watermark
last_load_date = spark.sql("""
SELECT MAX(load_timestamp) FROM lakehouse.load_history
WHERE table_name = 'FactOrders'
""").collect()[0][0]
df_new = spark.read.format("jdbc") \
.option("url", source_connection) \
.option("query", f"SELECT * FROM Orders WHERE modified_date > '{last_load_date}'") \
.load()
# MERGE — update existing, insert new
df_new.createOrReplaceTempView("source_orders")
spark.sql("""
MERGE INTO lakehouse.FactOrders AS target
USING source_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
Scenario: Carlos's incremental pipeline
Precision Manufacturing’s FactProduction table has 500 million rows. Full loading it takes 3 hours and costs significant compute.
Carlos switches to incremental loading using a watermark column (last_modified):
- Pipeline reads the last successful load timestamp from a control table
- Queries SAP for rows where
last_modified > @lastLoadTimestamp - Uses MERGE to upsert into the Delta table
- Updates the control table with the new timestamp
Result: daily loads drop from 3 hours to 12 minutes (processing ~200,000 changed rows instead of 500 million).
Streaming load (continuous)
Data flows into Fabric as it’s generated — no waiting for a scheduled batch.
Streaming options in Fabric
| Tool | Latency | Best For |
|---|---|---|
| Eventstreams | Seconds | Event Hub, Kafka, IoT Hub sources → KQL database, lakehouse, custom endpoints |
| Spark Structured Streaming | Seconds to minutes | Complex transformations on streaming data in notebooks |
| Mirroring | Near real-time (minutes) | Database replication without custom code |
Streaming to a lakehouse
Spark Structured Streaming writes to Delta tables in a lakehouse using micro-batches:
# PySpark — read from Event Hub, write to Delta table
stream_df = spark.readStream \
.format("eventhubs") \
.options(**eh_config) \
.load()
# Parse and transform
parsed = stream_df.select(
from_json(col("body").cast("string"), schema).alias("data")
).select("data.*")
# Write as streaming Delta table
parsed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/orders") \
.start("Tables/FactOrders_Streaming")
Exam tip: Checkpoint location
Every streaming write needs a checkpoint location — a folder where Spark tracks which data has been processed. Without it, restarting the stream would reprocess everything from the beginning.
Exam pattern: “Streaming notebook restarts and reprocesses all data” → missing or incorrect checkpoint location.
Handling late-arriving data
In real-world systems, data doesn’t always arrive on time. An order placed at 11:55 PM might not reach the data platform until 1:30 AM the next day.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Watermark with buffer | Load data where modified_date > @lastLoad - 2 hours (overlap window) | Simple; may reprocess some rows (MERGE handles deduplication) |
| Event time vs processing time | Use the event’s business timestamp, not when it arrived | Requires the source to include a reliable business timestamp |
| Streaming watermark | Spark watermark drops data arriving more than N minutes late | Balances completeness vs memory usage |
| Reprocessing window | Re-run the pipeline for the previous 2 days every night | Catches late arrivals but increases compute cost |
Scenario: Anika's late-arriving orders
ShopStream processes orders from 6 payment gateways. Some gateways batch-send confirmations with a 2-hour delay. Anika’s solution:
- Incremental load uses a watermark with a 2-hour buffer:
WHERE order_date > @lastLoad - INTERVAL 2 HOURS - MERGE handles deduplication — if an order was already loaded, it updates instead of inserting a duplicate
- For the streaming pipeline, she sets a Spark watermark:
.withWatermark("event_time", "2 hours")— events arriving more than 2 hours late are dropped
The buffer window means some rows are processed twice, but MERGE ensures the table stays clean.
Choosing the right pattern
| Factor | Full | Incremental | Streaming |
|---|---|---|---|
| Data volume | Small (under 1M rows) | Large (millions+) | Continuous flow |
| Freshness requirement | Daily/weekly is fine | Daily/hourly | Seconds to minutes |
| Source supports CDC? | Doesn’t matter | Strongly preferred | Required for streaming |
| Complexity | Low | Medium | High |
| Compute cost | High per run (but simple) | Low per run | Continuous (always running) |
Precision Manufacturing's DimProduct table has 5,000 rows. It changes infrequently (a few products added per week). Carlos needs to keep the lakehouse copy in sync. Which loading pattern is most appropriate?
A streaming pipeline writes orders to a Delta table. After a Spark cluster restart, the pipeline reprocesses all orders from the beginning, creating duplicates. What is the most likely cause?
Anika's daily incremental pipeline loads orders using a watermark. She notices that orders placed at 11:50 PM are sometimes missing from the next day's load (which filters for modified_date > yesterday 12:00 AM). What should she change?
🎬 Video coming soon
Next up: Dimensional Modeling: Prep for Analytics — design star schemas and slowly changing dimensions for your lakehouse.