Monitoring Clusters & Troubleshooting
Monitor cluster consumption, troubleshoot Lakeflow Jobs, and diagnose Spark job failures β the operational skills that keep production running.
Monitoring cluster consumption
Monitoring is reading the dashboard gauges while driving.
Speed (throughput), fuel level (cost), engine temperature (resource usage). If you donβt check the gauges, you run out of fuel (budget) or overheat (OOM errors) without warning.
Key metrics to monitor
| Metric | Where to Find | What It Tells You |
|---|---|---|
| DBU consumption | Account console β Usage | Cost by workspace, cluster, job |
| CPU utilisation | Cluster UI β Metrics | Whether youβre under/over-provisioned |
| Memory usage | Cluster UI β Metrics | Risk of OOM errors |
| Spill to disk | Spark UI β Stages | Memory pressure (data doesnβt fit in RAM) |
| Job duration trends | Job run history | Performance degradation over time |
| Cluster idle time | Compute UI | Wasted spend on idle clusters |
Cost optimisation actions
| Issue | Fix |
|---|---|
| High idle time | Reduce auto-termination timeout |
| Over-provisioned (low CPU) | Reduce worker count or node size |
| Under-provisioned (high spill) | Increase memory or worker count |
| Expensive always-on clusters | Switch to job compute or serverless |
| Dev clusters running overnight | Set auto-termination to 30 min |
Troubleshooting Lakeflow Jobs
Repair runs
When a job fails, you donβt have to re-run everything:
Job: nightly_etl (5 tasks)
β
ingest_crm (completed)
β
ingest_pos (completed)
β
clean_data (completed)
β build_reports (FAILED β OOM error)
βοΈ notify_team (skipped)
Repair run re-runs only build_reports and notify_team β the three successful tasks are not repeated.
Common job failures
| Symptom | Likely Cause | Fix |
|---|---|---|
| Task timeout | Query too slow, data too large | Increase timeout, optimize query, add nodes |
| OOM (Out of Memory) | Data doesnβt fit in memory | Increase node memory, reduce partition size, use disk-based operations |
| Cluster start failure | Quota exceeded, region capacity | Try a different node type or region |
| Source unavailable | Network/auth issue | Check connectivity, rotate expired credentials |
| Concurrent run conflict | Previous run still active | Set max concurrent runs to 1 |
Job operations
| Action | When to Use |
|---|---|
| Run | Start a new execution |
| Repair | Re-run only failed tasks from a failed run |
| Restart | Cancel current run and start fresh |
| Stop/Cancel | Stop a running execution |
Troubleshooting Spark jobs
Common Spark issues
| Issue | Symptom | Investigation |
|---|---|---|
| Slow stage | One stage takes much longer | Check Spark UI β Stages for skew |
| OOM error | Driver or executor out of memory | Reduce collect() calls, increase memory |
| Job hangs | Progress stops, no errors | Check for deadlocks, broadcast timeout |
| Data skew | One task processes much more data | Check Spark UI β Task metrics for uneven distribution |
Cluster restart for recovery
Sometimes the simplest fix is a cluster restart:
- When: persistent driver issues, memory leaks, corrupt state
- How: Stop and restart the cluster (or let auto-termination handle it)
- Caution: streaming jobs lose in-flight micro-batch state (checkpoints protect against data loss)
π¬ Video coming soon
Knowledge check
Ravi's nightly ETL job at DataPulse failed on task 4 of 5. Tasks 1-3 completed successfully and produced correct output. What is the most efficient way to recover?
Next up: Spark Performance: DAG & Query Profile β investigating caching, skew, spilling, and shuffle issues.