🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-700 Domain 2
Domain 2 — Module 9 of 10 90%
17 of 26 overall

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance
Domain 2: Ingest and Transform Data Premium ⏱ ~14 min read

Eventstreams & Spark Streaming: Real-Time Ingestion

Process streaming data as it arrives using Fabric Eventstreams and Spark Structured Streaming. Choose the right streaming engine for your workload.

Streaming in Fabric

☕ Simple explanation

Think of batch loading as a postal service — you get a delivery once a day. Streaming is a phone call — you hear every word the moment it’s spoken.

Fabric offers two streaming tools: Eventstreams (a managed service that captures events from sources like Event Hubs and routes them to destinations) and Spark Structured Streaming (code in a notebook that processes events with PySpark transformations).

Use Eventstreams when you want a visual, no-code pipeline for routing events. Use Spark Structured Streaming when you need complex transformations on the stream.

Eventstreams in Fabric Real-Time Intelligence is a managed event processing service that ingests data from event sources (Azure Event Hubs, IoT Hub, custom apps, CDC feeds), applies optional transformations, and routes events to destinations (KQL databases, lakehouses, custom endpoints, derived streams). It uses a visual canvas with drag-and-drop operators.

Spark Structured Streaming processes streaming data in PySpark notebooks using micro-batch or continuous processing modes. It reads from Event Hubs, Kafka, or file-based streams, applies DataFrame transformations, and writes to Delta Lake tables with exactly-once semantics via checkpointing.

Choosing a streaming engine

Eventstreams for routing; Spark Structured Streaming for complex transforms
FactorEventstreamsSpark Structured Streaming
InterfaceVisual canvas (drag-and-drop)Code (PySpark notebook)
Coding required?No (built-in operators)Yes (Python/Spark)
TransformationsFilter, project, aggregate, join (built-in operators)Any PySpark transformation (joins, ML, complex logic)
LatencySub-second to secondsSeconds to minutes (micro-batch)
DestinationsKQL database, lakehouse, derived streams, custom endpointsLakehouse (Delta tables)
Best forRouting events, simple transforms, multi-destination fanoutComplex transformations, ML scoring, stateful processing
MonitoringBuilt-in metrics and health dashboardSpark UI + custom logging
Exactly-once?At-least-once (dedup at destination)Exactly-once with checkpointing

Eventstreams

How Eventstreams work

Source (Event Hub, IoT Hub, CDC)
  → Eventstream (filter, transform, route)
    → Destination 1: KQL Database (real-time queries)
    → Destination 2: Lakehouse (Delta table for batch analytics)
    → Destination 3: Derived stream (feed another Eventstream)

Built-in operators

OperatorWhat It DoesExample
FilterKeep only events matching a conditionKeep only event_type == "purchase"
Manage fieldsSelect, rename, or remove columnsDrop PII columns before routing to analytics
Group byAggregate events in time windowsCount events per minute
UnionCombine multiple streamsMerge orders from 3 regional Event Hubs
ExpandFlatten nested JSON arraysExpand items[] array into individual rows
💡 Scenario: Zoe's clickstream pipeline

WaveMedia generates 2 million playback events per minute. Zoe builds an Eventstream:

  1. Source: Azure Event Hub receiving playback events
  2. Filter: Keep only event_type IN ("play", "pause", "complete") — drop heartbeat noise
  3. Manage fields: Remove user_ip (PII) before analytics
  4. Destinations:
    • KQL Database → real-time dashboard (what’s being watched right now?)
    • Lakehouse → Delta table for daily batch analysis (weekly trends)

Two destinations from one stream. The KQL database answers “what’s hot right now?” while the lakehouse answers “what was popular this week?”

Spark Structured Streaming

Reading from a stream

# Read from Event Hub
eh_config = {
    "eventhubs.connectionString": sc._jvm.org.apache.spark
        .eventhubs.EventHubsUtils.encrypt(connection_string),
    "eventhubs.consumerGroup": "$Default"
}

stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eh_config) \
    .load()

Transforming streaming data

from pyspark.sql.functions import from_json, col, window

# Parse JSON payload
schema = StructType([
    StructField("order_id", StringType()),
    StructField("amount", DoubleType()),
    StructField("timestamp", TimestampType()),
    StructField("region", StringType())
])

parsed = stream_df.select(
    from_json(                                     # Parse the JSON body
        col("body").cast("string"), schema
    ).alias("data")
).select("data.*")                                 # Flatten into columns

# Windowed aggregation: revenue per region per 5-minute window
windowed = parsed \
    .withWatermark("timestamp", "10 minutes") \    # Late data tolerance
    .groupBy(
        window("timestamp", "5 minutes"),           # 5-min tumbling window
        "region"
    ) \
    .agg(sum("amount").alias("window_revenue"))

Writing to Delta Lake

# Write streaming results to a Delta table
query = windowed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/order_revenue") \
    .start("Tables/OrderRevenue5Min")

query.awaitTermination()   # Keep the stream running
ℹ️ What's happening: Streaming watermarks

The .withWatermark("timestamp", "10 minutes") line tells Spark: “Accept events up to 10 minutes late. After that, drop them.”

Why it matters: In streaming, events don’t always arrive in order. A 10-minute watermark means Spark keeps state for the window until 10 minutes past the window’s end time. This balances completeness (catching late events) with memory usage (not keeping state forever).

Exam pattern: “Streaming aggregation is missing late events” → increase the watermark duration. “Streaming notebook is running out of memory” → the watermark might be too large.

Output modes

ModeBehaviourUse With
AppendOnly new rows are written to the sinkNon-aggregated streams, delta tables
CompleteEntire result table is written on every triggerSmall aggregation results (not for large state)
UpdateOnly changed rows are writtenAggregations where you want just the updated values

Question

What is the key difference between Eventstreams and Spark Structured Streaming?

Click or press Enter to reveal answer

Answer

Eventstreams: visual, no-code, sub-second latency, routes events to multiple destinations (KQL, lakehouse). Spark Structured Streaming: code-based (PySpark), seconds-to-minutes latency, complex transformations and ML scoring, writes to Delta tables.

Click to flip back

Question

What is a streaming watermark in Spark?

Click or press Enter to reveal answer

Answer

A watermark defines how late an event can arrive and still be included in a windowed aggregation. withWatermark('timestamp', '10 minutes') means events arriving more than 10 minutes late are dropped. Balances completeness vs memory usage.

Click to flip back

Question

What is the difference between append and update output modes in streaming?

Click or press Enter to reveal answer

Answer

Append: only new rows are emitted (good for non-aggregated streams). Update: only changed rows are emitted (good for aggregations where you want incremental updates). Complete: the entire result table is rewritten on every trigger.

Click to flip back


Knowledge Check

Zoe needs to route clickstream events to both a KQL database (real-time dashboard) and a lakehouse (weekly analytics) with simple filtering but no complex transformations. Which streaming tool should she use?

Knowledge Check

A Spark Structured Streaming job calculates 5-minute revenue windows. Late-arriving events (up to 15 minutes late) must be included. The current watermark is set to 5 minutes, and some late events are being dropped. What should the engineer change?

🎬 Video coming soon

Next up: Real-Time Intelligence: KQL & Windowing — query streaming data with KQL, choose between native tables and shortcuts, and build windowing functions.

← Previous

Transform Data with SQL & KQL

Next →

Real-Time Intelligence: KQL & Windowing

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.