Monitoring & Alerts: Catch Problems Early

Why monitor?

Simple explanation

Think of a factory floor with cameras and alarms.

Without cameras, you don’t know a machine stopped until the production line backs up. Without alarms, you find out about a spill when someone slips. Monitoring gives you visibility (cameras) and early warning (alarms).

Fabric’s Monitoring Hub is those cameras — you see every pipeline run, notebook execution, dataflow refresh, and semantic model refresh. Alerts are the alarms — they notify you when something fails or takes too long.

The Monitoring Hub

Item Type	Metrics Shown
Pipeline runs	Status (succeeded/failed/in progress), duration, start time, activity-level details
Notebook executions	Spark job status, duration, resource usage (vCores, memory)
Dataflow Gen2 refreshes	Refresh status, duration, rows processed, errors
Semantic model refresh	Refresh status, duration, tables refreshed, partition details
Spark jobs	Job status, stages, tasks, shuffle read/write, executor metrics
Eventstream health	Events ingested, processing lag, error rate

Key monitoring areas

Ingestion: Watch pipeline durations, row counts (zero rows = source issue), Dataflow connector errors, Eventstream processing lag.

Transformation: Watch notebook execution time, Spark stage failures, memory usage (90%+ is danger zone), shuffle distribution (skewed = one executor overloaded).

Semantic model refresh: Watch refresh duration vs schedule interval, partition refresh behaviour (full when incremental expected), memory limit errors.

Scenario: Zoe's monitoring routine

Zoe at WaveMedia checks the Monitoring Hub every morning:

Eventstreams: Processing lag under 5 seconds? ✅
Overnight notebooks: 3/4 succeeded, 1 failed at 2:47 AM (OOM error) — increases pool size
Semantic model: Refreshed at 5 AM, duration 12 min (under 15 min SLA) ✅

Configuring alerts

Alert Source	How It Works	Best For
Pipeline failure path	Add Teams/email activity on failure output	Immediate notification for ETL failures
Data Activator rules	Condition-based triggers on streaming data	Real-time SLA monitoring
Power BI alerts	Visual value crosses threshold	Business metric anomalies

Scenario: Carlos's alert layers

Carlos configures three layers: (1) Pipeline failure → Teams channel post, (2) Eventstream lag > 60s → email on-call engineer, (3) Defect rate > 5% → alert quality manager via Power BI.

Question

What is the Fabric Monitoring Hub?

Click or press Enter to reveal answer

Answer

A centralised dashboard showing status, duration, and outcome of all workspace activity — pipeline runs, notebook executions, Dataflow refreshes, Spark jobs, semantic model refreshes, and Eventstream health.

Click to flip back

Question

What is Data Activator?

Click or press Enter to reveal answer

Answer

A rule-based alert engine in RTI. Set conditions on data (e.g., processing lag > 60s) and trigger actions (email, Teams, Power Automate flow).

Click to flip back

Question

How do you alert on pipeline failure?

Click or press Enter to reveal answer

Answer

Add a Teams or email activity on the pipeline's failure path. It fires when the upstream activity fails, including error details in the notification.

Click to flip back

Knowledge Check

Zoe's overnight notebook has been taking 45 min instead of 20 min. The Monitoring Hub shows high shuffle on one executor. What's the likely cause?

Knowledge Check

Carlos wants Teams notifications when pipelines fail. Where does he configure this?

Next up: Troubleshoot Pipelines & Dataflows — identify and resolve the most common pipeline and Dataflow Gen2 errors.