Eventstreams & Spark Streaming: Real-Time Ingestion
Process streaming data as it arrives using Fabric Eventstreams and Spark Structured Streaming. Choose the right streaming engine for your workload.
Streaming in Fabric
Think of batch loading as a postal service — you get a delivery once a day. Streaming is a phone call — you hear every word the moment it’s spoken.
Fabric offers two streaming tools: Eventstreams (a managed service that captures events from sources like Event Hubs and routes them to destinations) and Spark Structured Streaming (code in a notebook that processes events with PySpark transformations).
Use Eventstreams when you want a visual, no-code pipeline for routing events. Use Spark Structured Streaming when you need complex transformations on the stream.
Choosing a streaming engine
| Factor | Eventstreams | Spark Structured Streaming |
|---|---|---|
| Interface | Visual canvas (drag-and-drop) | Code (PySpark notebook) |
| Coding required? | No (built-in operators) | Yes (Python/Spark) |
| Transformations | Filter, project, aggregate, join (built-in operators) | Any PySpark transformation (joins, ML, complex logic) |
| Latency | Sub-second to seconds | Seconds to minutes (micro-batch) |
| Destinations | KQL database, lakehouse, derived streams, custom endpoints | Lakehouse (Delta tables) |
| Best for | Routing events, simple transforms, multi-destination fanout | Complex transformations, ML scoring, stateful processing |
| Monitoring | Built-in metrics and health dashboard | Spark UI + custom logging |
| Exactly-once? | At-least-once (dedup at destination) | Exactly-once with checkpointing |
Eventstreams
How Eventstreams work
Source (Event Hub, IoT Hub, CDC)
→ Eventstream (filter, transform, route)
→ Destination 1: KQL Database (real-time queries)
→ Destination 2: Lakehouse (Delta table for batch analytics)
→ Destination 3: Derived stream (feed another Eventstream)
Built-in operators
| Operator | What It Does | Example |
|---|---|---|
| Filter | Keep only events matching a condition | Keep only event_type == "purchase" |
| Manage fields | Select, rename, or remove columns | Drop PII columns before routing to analytics |
| Group by | Aggregate events in time windows | Count events per minute |
| Union | Combine multiple streams | Merge orders from 3 regional Event Hubs |
| Expand | Flatten nested JSON arrays | Expand items[] array into individual rows |
Scenario: Zoe's clickstream pipeline
WaveMedia generates 2 million playback events per minute. Zoe builds an Eventstream:
- Source: Azure Event Hub receiving playback events
- Filter: Keep only
event_type IN ("play", "pause", "complete")— drop heartbeat noise - Manage fields: Remove
user_ip(PII) before analytics - Destinations:
- KQL Database → real-time dashboard (what’s being watched right now?)
- Lakehouse → Delta table for daily batch analysis (weekly trends)
Two destinations from one stream. The KQL database answers “what’s hot right now?” while the lakehouse answers “what was popular this week?”
Spark Structured Streaming
Reading from a stream
# Read from Event Hub
eh_config = {
"eventhubs.connectionString": sc._jvm.org.apache.spark
.eventhubs.EventHubsUtils.encrypt(connection_string),
"eventhubs.consumerGroup": "$Default"
}
stream_df = spark.readStream \
.format("eventhubs") \
.options(**eh_config) \
.load()
Transforming streaming data
from pyspark.sql.functions import from_json, col, window
# Parse JSON payload
schema = StructType([
StructField("order_id", StringType()),
StructField("amount", DoubleType()),
StructField("timestamp", TimestampType()),
StructField("region", StringType())
])
parsed = stream_df.select(
from_json( # Parse the JSON body
col("body").cast("string"), schema
).alias("data")
).select("data.*") # Flatten into columns
# Windowed aggregation: revenue per region per 5-minute window
windowed = parsed \
.withWatermark("timestamp", "10 minutes") \ # Late data tolerance
.groupBy(
window("timestamp", "5 minutes"), # 5-min tumbling window
"region"
) \
.agg(sum("amount").alias("window_revenue"))
Writing to Delta Lake
# Write streaming results to a Delta table
query = windowed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/order_revenue") \
.start("Tables/OrderRevenue5Min")
query.awaitTermination() # Keep the stream running
What's happening: Streaming watermarks
The .withWatermark("timestamp", "10 minutes") line tells Spark: “Accept events up to 10 minutes late. After that, drop them.”
Why it matters: In streaming, events don’t always arrive in order. A 10-minute watermark means Spark keeps state for the window until 10 minutes past the window’s end time. This balances completeness (catching late events) with memory usage (not keeping state forever).
Exam pattern: “Streaming aggregation is missing late events” → increase the watermark duration. “Streaming notebook is running out of memory” → the watermark might be too large.
Output modes
| Mode | Behaviour | Use With |
|---|---|---|
| Append | Only new rows are written to the sink | Non-aggregated streams, delta tables |
| Complete | Entire result table is written on every trigger | Small aggregation results (not for large state) |
| Update | Only changed rows are written | Aggregations where you want just the updated values |
Zoe needs to route clickstream events to both a KQL database (real-time dashboard) and a lakehouse (weekly analytics) with simple filtering but no complex transformations. Which streaming tool should she use?
A Spark Structured Streaming job calculates 5-minute revenue windows. Late-arriving events (up to 15 minutes late) must be included. The current watermark is set to 5 minutes, and some late events are being dropped. What should the engineer change?
🎬 Video coming soon
Next up: Real-Time Intelligence: KQL & Windowing — query streaming data with KQL, choose between native tables and shortcuts, and build windowing functions.