Lakeflow Jobs: Schedule, Alerts & Recovery
Schedule jobs with cron expressions, configure failure alerts, and set up automatic restarts — making your pipelines self-healing.
Scheduling jobs
Scheduling is setting an alarm clock for your pipeline.
You tell the job “run every night at 3 AM” or “run every 2 hours on weekdays.” Databricks uses cron expressions — a standard way to describe schedules.
Common cron patterns
| Schedule | Cron Expression | Notes |
|---|---|---|
| Every day at 3 AM | 0 3 * * * | Standard nightly ETL |
| Every 2 hours | 0 */2 * * * | Frequent refresh |
| Weekdays at 8 AM | 0 8 * * 1-5 | Business-hours only |
| First of every month | 0 0 1 * * | Monthly aggregation |
Timezone: Always set the timezone explicitly. “3 AM” means different things in UTC vs NZST.
Alerts
Configure alerts to notify your team when jobs fail or take too long:
| Alert Type | When It Fires |
|---|---|
| On failure | Any task in the job fails |
| On success | Job completes successfully |
| On duration exceeded | Job runs longer than a threshold |
| On start | Job begins execution |
Alerts can notify via: email, Slack, PagerDuty, webhooks, or Microsoft Teams.
Ravi configures failure alerts at DataPulse so the on-call engineer gets a Slack message within seconds of any pipeline failure.
Automatic restarts and retries
Task-level retries
Configure retries for individual tasks that may fail due to transient issues:
| Setting | Purpose |
|---|---|
| Max retries | Number of retry attempts (e.g., 3) |
| Min retry interval | Wait time between retries (e.g., 30 seconds) |
| Retry on timeout | Whether to retry when a task times out |
Pipeline automatic restarts
For Declarative Pipelines, configure continuous mode with automatic restart:
- Pipeline runs continuously
- If it fails, it automatically restarts after a configurable delay
- Ideal for streaming pipelines that must stay running
Job-level recovery
| Feature | What It Does |
|---|---|
| Repair run | Re-run only the failed tasks (not the entire job) |
| Max concurrent runs | Prevent overlapping runs (default: 1) |
| Timeout | Kill the job if it exceeds a time limit |
🎬 Video coming soon
Knowledge check
Ravi's nightly ETL job at DataPulse occasionally fails on the ingestion task due to a flaky source API. The API usually recovers within a minute. What configuration minimises manual intervention?
Next up: Git & Version Control — Git best practices, branching, and pull request workflows in Databricks.