Monitoring & Alerts: Catch Problems Early
Use the Fabric Monitoring Hub to track data ingestion, transformations, and semantic model refreshes. Configure alerts to catch failures before users notice.
Why monitor?
Think of a factory floor with cameras and alarms.
Without cameras, you donβt know a machine stopped until the production line backs up. Without alarms, you find out about a spill when someone slips. Monitoring gives you visibility (cameras) and early warning (alarms).
Fabricβs Monitoring Hub is those cameras β you see every pipeline run, notebook execution, dataflow refresh, and semantic model refresh. Alerts are the alarms β they notify you when something fails or takes too long.
The Monitoring Hub
| Item Type | Metrics Shown |
|---|---|
| Pipeline runs | Status (succeeded/failed/in progress), duration, start time, activity-level details |
| Notebook executions | Spark job status, duration, resource usage (vCores, memory) |
| Dataflow Gen2 refreshes | Refresh status, duration, rows processed, errors |
| Semantic model refresh | Refresh status, duration, tables refreshed, partition details |
| Spark jobs | Job status, stages, tasks, shuffle read/write, executor metrics |
| Eventstream health | Events ingested, processing lag, error rate |
Key monitoring areas
Ingestion: Watch pipeline durations, row counts (zero rows = source issue), Dataflow connector errors, Eventstream processing lag.
Transformation: Watch notebook execution time, Spark stage failures, memory usage (90%+ is danger zone), shuffle distribution (skewed = one executor overloaded).
Semantic model refresh: Watch refresh duration vs schedule interval, partition refresh behaviour (full when incremental expected), memory limit errors.
Scenario: Zoe's monitoring routine
Zoe at WaveMedia checks the Monitoring Hub every morning:
- Eventstreams: Processing lag under 5 seconds? β
- Overnight notebooks: 3/4 succeeded, 1 failed at 2:47 AM (OOM error) β increases pool size
- Semantic model: Refreshed at 5 AM, duration 12 min (under 15 min SLA) β
Configuring alerts
| Alert Source | How It Works | Best For |
|---|---|---|
| Pipeline failure path | Add Teams/email activity on failure output | Immediate notification for ETL failures |
| Data Activator rules | Condition-based triggers on streaming data | Real-time SLA monitoring |
| Power BI alerts | Visual value crosses threshold | Business metric anomalies |
Scenario: Carlos's alert layers
Carlos configures three layers: (1) Pipeline failure β Teams channel post, (2) Eventstream lag > 60s β email on-call engineer, (3) Defect rate > 5% β alert quality manager via Power BI.
Zoe's overnight notebook has been taking 45 min instead of 20 min. The Monitoring Hub shows high shuffle on one executor. What's the likely cause?
Carlos wants Teams notifications when pipelines fail. Where does he configure this?
π¬ Video coming soon
Next up: Troubleshoot Pipelines & Dataflows β identify and resolve the most common pipeline and Dataflow Gen2 errors.