Lakeflow Jobs: Schedule, Alerts & Recovery | Guided by A Guide to Cloud

Scheduling jobs

Simple explanation

Scheduling is setting an alarm clock for your pipeline.

You tell the job “run every night at 3 AM” or “run every 2 hours on weekdays.” Databricks uses cron expressions — a standard way to describe schedules.

Common cron patterns

Schedule	Cron Expression	Notes
Every day at 3 AM	`0 3 * * *`	Standard nightly ETL
Every 2 hours	`0 /2 * *`	Frequent refresh
Weekdays at 8 AM	`0 8 * * 1-5`	Business-hours only
First of every month	`0 0 1 * *`	Monthly aggregation

Timezone: Always set the timezone explicitly. “3 AM” means different things in UTC vs NZST.

Alerts

Configure alerts to notify your team when jobs fail or take too long:

Alert Type	When It Fires
On failure	Any task in the job fails
On success	Job completes successfully
On duration exceeded	Job runs longer than a threshold
On start	Job begins execution

Alerts can notify via: email, Slack, PagerDuty, webhooks, or Microsoft Teams.

Ravi configures failure alerts at DataPulse so the on-call engineer gets a Slack message within seconds of any pipeline failure.

Automatic restarts and retries

Task-level retries

Configure retries for individual tasks that may fail due to transient issues:

Setting	Purpose
Max retries	Number of retry attempts (e.g., 3)
Min retry interval	Wait time between retries (e.g., 30 seconds)
Retry on timeout	Whether to retry when a task times out

Pipeline automatic restarts

For Declarative Pipelines, configure continuous mode with automatic restart:

Pipeline runs continuously
If it fails, it automatically restarts after a configurable delay
Ideal for streaming pipelines that must stay running

Job-level recovery

Feature	What It Does
Repair run	Re-run only the failed tasks (not the entire job)
Max concurrent runs	Prevent overlapping runs (default: 1)
Timeout	Kill the job if it exceeds a time limit

Question

What are the four alert types for Lakeflow Jobs?

Click or press Enter to reveal answer

Answer

On failure, on success, on duration exceeded, and on start. Alerts can notify via email, Slack, PagerDuty, webhooks, or Teams.

Click to flip back

Question

What is a repair run in Lakeflow Jobs?

Click or press Enter to reveal answer

Answer

A repair run re-executes only the failed tasks (and their downstream dependents) rather than restarting the entire job. Saves time and compute when only one task in a multi-task job failed.

Click to flip back

Question

How do you configure automatic restarts for streaming pipelines?

Click or press Enter to reveal answer

Answer

Set the Declarative Pipeline to continuous mode with automatic restart. If it fails, it restarts after a configurable delay. Essential for streaming pipelines that must stay running.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi's nightly ETL job at DataPulse occasionally fails on the ingestion task due to a flaky source API. The API usually recovers within a minute. What configuration minimises manual intervention?

Next up: Git & Version Control — Git best practices, branching, and pull request workflows in Databricks.