🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 — Module 6 of 10 60%
16 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~14 min read

Streaming Ingestion: Structured Streaming & Event Hubs

Process real-time data with Spark Structured Streaming and ingest from Azure Event Hubs — the streaming foundations for DP-750.

Spark Structured Streaming

☕ Simple explanation

Structured Streaming treats a data stream as an ever-growing table.

Imagine a notebook on your desk that keeps getting new pages added to the bottom. Each time you look, you process only the new pages. That’s Structured Streaming — it remembers where it left off (checkpoint) and only processes new data.

You write streaming code almost identically to batch code. The magic is in readStream and writeStream — Spark handles the continuous processing loop for you.

Spark Structured Streaming processes data incrementally as it arrives, treating the stream as an unbounded table. It provides exactly-once processing guarantees through checkpointing and write-ahead logs. The API mirrors batch DataFrames — you use the same transformations (filter, join, aggregate) but with readStream/writeStream.

Core streaming pattern

# Read from a streaming source
stream_df = (spark.readStream
    .format("delta")
    .table("bronze.raw_transactions"))

# Transform (same API as batch)
clean_df = (stream_df
    .filter("amount > 0")
    .withColumn("processed_at", current_timestamp()))

# Write to a streaming sink
query = (clean_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/silver_txn")
    .trigger(availableNow=True)  # process all available, then stop
    .toTable("silver.valid_transactions"))

Trigger modes

TriggerBehaviourUse Case
Default (no trigger)Processes micro-batches continuouslyTrue real-time processing
processingTime=“10 seconds”Triggers every 10 secondsNear-real-time with controlled resource use
availableNow=TrueProcesses all available data, then stopsScheduled incremental batch
once=TrueProcesses one micro-batch, then stopsOne-shot catch-up (deprecated, use availableNow)

Tomás uses default continuous for NovaPay’s fraud detection (every millisecond counts). Ravi uses availableNow for DataPulse’s hourly incremental loads (batch-like but with streaming guarantees).

Checkpoints

The checkpoint location is critical — it tracks which data has been processed:

  • Stores offsets (where in the stream you are)
  • Stores committed batch IDs
  • Enables exactly-once processing — if the job crashes, it resumes from the last checkpoint
  • Never delete checkpoints unless you want to reprocess all data

Ingesting from Azure Event Hubs

Azure Event Hubs is a managed streaming platform. Tomás uses it at NovaPay to ingest real-time transaction events:

# Connection configuration
eh_conf = {
    "eventhubs.connectionString":
        dbutils.secrets.get("kv-prod", "eventhub-connection-string"),
    "eventhubs.consumerGroup": "fraud-detection-group",
    "maxEventsPerTrigger": 10000
}

# Read from Event Hubs
eh_stream = (spark.readStream
    .format("eventhubs")
    .options(**eh_conf)
    .load())

# Event Hubs data arrives as binary — decode it
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

txn_schema = StructType() \
    .add("txn_id", StringType()) \
    .add("amount", DoubleType()) \
    .add("merchant", StringType()) \
    .add("timestamp", TimestampType())

decoded = (eh_stream
    .select(from_json(col("body").cast("string"), txn_schema).alias("data"))
    .select("data.*"))

# Write to Delta table
query = (decoded.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/eh_transactions")
    .toTable("bronze.realtime_transactions"))
ℹ️ Event Hubs key concepts
ConceptPurpose
NamespaceContainer for Event Hubs (like a server)
Event HubThe actual stream topic
Consumer groupAllows multiple readers of the same stream
PartitionEnables parallel reading (more partitions = more throughput)
BodyThe message payload (usually JSON, arrives as binary bytes)

Exam tip: Event Hubs data arrives as binary in the body column. You must CAST to string and parse with from_json. Forgetting this step is a common exam trap.

Question

What is a checkpoint in Structured Streaming and why is it critical?

Click or press Enter to reveal answer

Answer

A checkpoint tracks which data has been processed (offsets + committed batches). It enables exactly-once processing — if the job crashes, it resumes from the checkpoint. Never delete checkpoints unless reprocessing is intended.

Click to flip back

Question

What trigger mode should you use for scheduled incremental loads?

Click or press Enter to reveal answer

Answer

availableNow=True — processes all available data since the last checkpoint, then stops. Gives batch-like scheduling with streaming's exactly-once guarantees.

Click to flip back

Question

How does Event Hubs data arrive in Spark and what must you do?

Click or press Enter to reveal answer

Answer

Event Hubs data arrives as binary bytes in the 'body' column. You must cast to string and parse with from_json() using the expected schema. Forgetting the cast/parse step is a common exam trap.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi needs an hourly incremental load that processes only new data from a Delta source table, with exactly-once guarantees. The job should stop after processing, not run continuously. Which approach should he use?


Next up: Auto Loader & Declarative Pipelines — scalable file ingestion and Lakeflow Spark Declarative Pipelines.

← Previous

Ingesting Data: SQL Methods & CDC

Next →

Auto Loader & Declarative Pipelines

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.