🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-700 Domain 2
Domain 2 — Module 2 of 10 20%
10 of 26 overall

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance
Domain 2: Ingest and Transform Data Free ⏱ ~14 min read

Loading Patterns: Full, Incremental & Streaming

Design the right data loading strategy — full loads for small datasets, incremental for large ones, and streaming for real-time. Handle late-arriving data gracefully.

Three loading patterns

☕ Simple explanation

Think of three ways to update a photo album.

Full load — throw away all photos and reprint the entire album. Simple but wasteful. Fine if you have 50 photos.

Incremental load — only print new and changed photos, then slot them into the right places. Efficient but you need to know what changed.

Streaming — photos appear in the album the moment they’re taken, one by one, in real time. No waiting for a batch.

The right pattern depends on data volume, freshness requirements, and source system capabilities.

Data loading patterns determine how data moves from source systems into Fabric lakehouses and warehouses. The three primary patterns — full load, incremental load, and streaming load — each have different trade-offs in complexity, latency, resource usage, and data freshness.

The DP-700 exam tests your ability to select the right pattern for a given scenario, considering factors like data volume, change frequency, source capabilities (CDC, watermarks, timestamps), and business latency requirements.

Full load (complete refresh)

Replace the entire target table with a complete copy of the source data. Every run processes ALL rows.

When to use full load

ScenarioWhy Full Load Works
Small reference tables (countries, product categories)Volume is tiny — no benefit from incremental
Source system has no change trackingYou can’t identify what changed, so you load everything
Data quality requires a clean slateAccumulated incremental errors are wiped each load
Initial load for a new tableFirst load is always full

Implementation

# PySpark — full load (overwrite)
df_source = spark.read.format("jdbc") \
    .option("url", source_connection) \
    .option("dbtable", "DimProduct") \
    .load()

df_source.write.format("delta") \
    .mode("overwrite") \
    .save("Tables/DimProduct")

Pros: Simple, self-healing (bad data gets replaced), no change tracking needed. Cons: Slow for large tables, wastes compute/network, loses table history (overwrite).

Incremental load (changes only)

Load only rows that are new or changed since the last run. This requires a mechanism to detect changes.

Change detection methods

Choose the detection method based on what the source system supports
MethodHow It WorksBest For
Watermark columnFilter on a timestamp or ID column: WHERE modified_date > @lastLoadDateSources with reliable modified_date or auto-increment ID
Change Data Capture (CDC) / Change FeedSource system publishes inserts, updates, deletes as a change feedAzure SQL (CDC), Cosmos DB (change feed) — databases with built-in change tracking
Delta change data feedDelta table's own change log — tracks row-level changes between versionsWhen the source is already a Delta table in Fabric
File arrival dateProcess files that arrived since the last run (based on file modified timestamp)File-based sources (CSV, JSON drops in blob storage)
MirroringFabric mirrors the source database automatically in near real-timeWhen you want zero-code incremental replication (covered in a later module)

Watermark pattern (most common)

# PySpark — incremental load with watermark
last_load_date = spark.sql("""
    SELECT MAX(load_timestamp) FROM lakehouse.load_history 
    WHERE table_name = 'FactOrders'
""").collect()[0][0]

df_new = spark.read.format("jdbc") \
    .option("url", source_connection) \
    .option("query", f"SELECT * FROM Orders WHERE modified_date > '{last_load_date}'") \
    .load()

# MERGE — update existing, insert new
df_new.createOrReplaceTempView("source_orders")
spark.sql("""
    MERGE INTO lakehouse.FactOrders AS target
    USING source_orders AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")
💡 Scenario: Carlos's incremental pipeline

Precision Manufacturing’s FactProduction table has 500 million rows. Full loading it takes 3 hours and costs significant compute.

Carlos switches to incremental loading using a watermark column (last_modified):

  1. Pipeline reads the last successful load timestamp from a control table
  2. Queries SAP for rows where last_modified > @lastLoadTimestamp
  3. Uses MERGE to upsert into the Delta table
  4. Updates the control table with the new timestamp

Result: daily loads drop from 3 hours to 12 minutes (processing ~200,000 changed rows instead of 500 million).

Streaming load (continuous)

Data flows into Fabric as it’s generated — no waiting for a scheduled batch.

Streaming options in Fabric

ToolLatencyBest For
EventstreamsSecondsEvent Hub, Kafka, IoT Hub sources → KQL database, lakehouse, custom endpoints
Spark Structured StreamingSeconds to minutesComplex transformations on streaming data in notebooks
MirroringNear real-time (minutes)Database replication without custom code

Streaming to a lakehouse

Spark Structured Streaming writes to Delta tables in a lakehouse using micro-batches:

# PySpark — read from Event Hub, write to Delta table
stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eh_config) \
    .load()

# Parse and transform
parsed = stream_df.select(
    from_json(col("body").cast("string"), schema).alias("data")
).select("data.*")

# Write as streaming Delta table
parsed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/orders") \
    .start("Tables/FactOrders_Streaming")
💡 Exam tip: Checkpoint location

Every streaming write needs a checkpoint location — a folder where Spark tracks which data has been processed. Without it, restarting the stream would reprocess everything from the beginning.

Exam pattern: “Streaming notebook restarts and reprocesses all data” → missing or incorrect checkpoint location.

Handling late-arriving data

In real-world systems, data doesn’t always arrive on time. An order placed at 11:55 PM might not reach the data platform until 1:30 AM the next day.

StrategyHow It WorksTrade-off
Watermark with bufferLoad data where modified_date > @lastLoad - 2 hours (overlap window)Simple; may reprocess some rows (MERGE handles deduplication)
Event time vs processing timeUse the event’s business timestamp, not when it arrivedRequires the source to include a reliable business timestamp
Streaming watermarkSpark watermark drops data arriving more than N minutes lateBalances completeness vs memory usage
Reprocessing windowRe-run the pipeline for the previous 2 days every nightCatches late arrivals but increases compute cost
ℹ️ Scenario: Anika's late-arriving orders

ShopStream processes orders from 6 payment gateways. Some gateways batch-send confirmations with a 2-hour delay. Anika’s solution:

  1. Incremental load uses a watermark with a 2-hour buffer: WHERE order_date > @lastLoad - INTERVAL 2 HOURS
  2. MERGE handles deduplication — if an order was already loaded, it updates instead of inserting a duplicate
  3. For the streaming pipeline, she sets a Spark watermark: .withWatermark("event_time", "2 hours") — events arriving more than 2 hours late are dropped

The buffer window means some rows are processed twice, but MERGE ensures the table stays clean.

Choosing the right pattern

FactorFullIncrementalStreaming
Data volumeSmall (under 1M rows)Large (millions+)Continuous flow
Freshness requirementDaily/weekly is fineDaily/hourlySeconds to minutes
Source supports CDC?Doesn’t matterStrongly preferredRequired for streaming
ComplexityLowMediumHigh
Compute costHigh per run (but simple)Low per runContinuous (always running)

Question

What is a watermark column in incremental loading?

Click or press Enter to reveal answer

Answer

A column (usually a timestamp like modified_date or an auto-increment ID) used to identify which rows have changed since the last load. The pipeline filters: WHERE modified_date > @lastLoadDate.

Click to flip back

Question

Why does streaming to Delta require a checkpoint location?

Click or press Enter to reveal answer

Answer

The checkpoint tracks which data has been processed. Without it, restarting the stream would reprocess everything from the beginning. The checkpoint folder stores offsets, committed batch IDs, and watermark state.

Click to flip back

Question

How does a watermark buffer handle late-arriving data?

Click or press Enter to reveal answer

Answer

Instead of loading WHERE date > @lastLoad, you use WHERE date > @lastLoad - 2 hours. This overlapping window catches data that arrived late. MERGE prevents duplicates — if a row was already loaded, it updates instead of inserting again.

Click to flip back


Knowledge Check

Precision Manufacturing's DimProduct table has 5,000 rows. It changes infrequently (a few products added per week). Carlos needs to keep the lakehouse copy in sync. Which loading pattern is most appropriate?

Knowledge Check

A streaming pipeline writes orders to a Delta table. After a Spark cluster restart, the pipeline reprocesses all orders from the beginning, creating duplicates. What is the most likely cause?

Knowledge Check

Anika's daily incremental pipeline loads orders using a watermark. She notices that orders placed at 11:50 PM are sometimes missing from the next day's load (which filters for modified_date > yesterday 12:00 AM). What should she change?

🎬 Video coming soon

Next up: Dimensional Modeling: Prep for Analytics — design star schemas and slowly changing dimensions for your lakehouse.

← Previous

Delta Lake: The Heart of Fabric

Next →

Dimensional Modeling: Prep for Analytics

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.