Failed Job Burst (>5 failures in 1h), Databricks

Card class: Sensitivity • Category: Nerve Centre

At a glance

An alert that fires when more than 5 Databricks job runs reach a FAILED (or TIMEDOUT) terminal state within a rolling 1-hour window. This card is Databricks-distinctive: pipeline failures cascade fast. One broken upstream table or one expired token can fail every dependent job in minutes, so a single root cause can produce a burst of five, ten, or twenty failures almost simultaneously. The burst pattern is the signal that this is a systemic problem, not five unrelated one-off failures.


Data source	Databricks Jobs API, `GET /api/2.1/jobs/runs/list`, reading the terminal `result_state` of each completed run. Reconciled against the `system.lakeflow.job_run_timeline` system table for the historical record.
Metric basis	A count of distinct job runs whose `result_state` is `FAILED` or `TIMEDOUT` with an `end_time` inside the trailing 60 minutes. The alert is a burst detector, not a daily total.
Aggregation window	`1h` rolling, evaluated every minute.
Alert trigger	`>5 jobs FAILED in last 1h`. The sixth failure inside the hour escalates the card.
Why a burst, not a single failure	A lone failure is routine (transient cloud error, a flaky external API). Five or more in an hour almost always share a root cause: a schema change broke a shared table, a credential expired, a cluster pool ran out of capacity, or an upstream job failed and took its downstream dependents with it.
What counts	Any scheduled or triggered job run reaching `FAILED` or `TIMEDOUT`. Retries that ultimately fail count; retries that ultimately succeed do not.
What does NOT count	(1) Runs that succeeded; (2) `CANCELED` runs (a human stopped them deliberately); (3) `SKIPPED` runs; (4) tasks within a job, the count is at run level, not task level, to avoid double-counting a multi-task job.
Time zone	Workspace time zone for chart axes; UTC for cross-connector windowing.
Time window	`1h` rolling.
Roles	owner, platform engineering, data engineering on-call

Calculation

The engine maintains a rolling 60-minute window over completed job runs and counts terminal failures:

failed_burst = COUNT(run)
               WHERE run.result_state IN ('FAILED', 'TIMEDOUT')
               AND   run.end_time >= (now - 60 minutes)

FIRE when failed_burst > 5

TIMEDOUT is grouped with FAILED because, from a pipeline-health perspective, a job that blew past its timeout is just as broken as one that threw an exception, and timeouts often appear in bursts when a shared cluster is saturated. CANCELED is excluded: a cancellation is a deliberate human action, not a failure, and including it would fire the alert every time someone aborts a stuck run. Counting happens at the run level, not the task level. A single workflow with twelve tasks that fails counts as 1, not 12. This is deliberate: the failure of one upstream task usually cascades to fail every downstream task in the same run, and counting tasks would make a single broken job look like a burst on its own. The burst signal is meant to catch many separate jobs failing, which is the fingerprint of a shared root cause. The window is evaluated every minute against the Jobs API. Because the API reports a run’s result_state only once the run reaches a terminal state, the alert is necessarily reactive: it fires after the failures land, not before. For predictive cost-runaway on jobs that are still alive, pair with Long-Running Jobs (>1h).

Worked example

A data engineering team runs a medallion architecture on Databricks feeding an ecommerce analytics layer: bronze ingestion jobs land raw order and product data, silver jobs clean and conform it, gold jobs build the marts that BI dashboards read. Snapshot taken on 14 Apr 26 at 03:40 BST, mid-nightly-batch.

Time	Job	Layer	result_state	Likely cause
03:12	bronze-orders-ingest	Bronze	FAILED	Source schema added a non-null column
03:14	silver-orders-clean	Silver	FAILED	Upstream bronze table empty
03:15	silver-order-items	Silver	FAILED	Upstream bronze table empty
03:18	gold-revenue-mart	Gold	FAILED	Upstream silver missing
03:19	gold-customer-360	Gold	FAILED	Upstream silver missing
03:22	gold-inventory-snapshot	Gold	TIMEDOUT	Waited on missing silver, hit timeout

By 03:22 the rolling 1-hour count has reached 6 failed runs and the card escalates with the headline 6 job failures in 1h, cascade from bronze-orders-ingest. The on-call data engineer reads the burst correctly in seconds: this is not six problems, it is one problem (the bronze ingest broke) cascading down the medallion. The triage playbook the burst enables:

Read the burst, not the individual failures. The timestamps cluster tightly (03:12 to 03:22) and the dependency chain is obvious: bronze failed first, everything downstream failed because its input was missing. The root cause is the 03:12 failure; the other five are collateral.
Find the trigger. The bronze job log shows a schema-evolution error: the source system added a loyalty_tier column declared NOT NULL, and the ingest job’s strict schema rejected it. A one-line fix (enable schema evolution or add the column to the target) unblocks the whole chain.
Decide on the rerun order. Fixing and rerunning bronze first, then triggering the downstream jobs in dependency order, is far cheaper than blindly rerunning all six and watching the downstream ones fail again on still-missing data.
Quantify the blast radius. Top 10 Failing Workflows (7d) confirms whether this is a first-time break or a recurring fragility, and Pipeline Lag (since last success) shows how stale the gold marts now are for the morning’s dashboards.

Why the burst framing saves time:
  - Naive reading: "6 jobs failed, open 6 tickets, debug 6 stack traces."
  - Burst reading: "1 root cause at 03:12, 5 downstream casualties.
    Fix 1, rerun in order, done."
  - The cascade structure is the diagnosis. A burst that all points
    back to one upstream job is a dependency cascade; a burst of
    unrelated jobs failing at once is usually shared infrastructure
    (a cluster pool, a credential, a metastore outage).

The reading that distinguishes the two burst shapes: if every failure traces to one upstream job, fix that job. If the failures are unrelated jobs that just happened to fail together, suspect shared infrastructure (an expired service principal token, a metastore hiccup, or a cluster pool that ran dry).

Sibling cards

Card	Why pair it with Failed Job Burst	What the combination tells you
Failed Jobs (24h)	The daily triage queue behind the burst alert.	A burst that lifts the 24h total sharply is a new incident; a flat 24h total with no burst is steady-state.
Job Success Rate (24h)	The proportional health view.	A burst that drops success rate below 95% is materially damaging the batch.
Top 10 Failing Workflows (7d)	Tells you if the bursting jobs are chronically fragile.	The same workflow topping the list weekly equals a structural fix needed, not a rerun.
Pipeline Lag (since last success)	Measures the downstream staleness the burst caused.	High lag after a burst means dashboards are serving stale data right now.
Long-Running Jobs (>1h)	Catches the jobs heading toward a TIMEDOUT failure.	A long-running job that later times out is a future contributor to the burst.
DLT Pipeline Status Distribution	The DLT-pipeline equivalent of the burst.	Many DLT pipelines in Failed state alongside the job burst equals a workspace-wide event.
Pipeline Lag vs Ecom Order Flow	The cross-channel impact of a stalled pipeline.	A burst that stalls ingestion while orders keep flowing means the business is generating data the pipeline cannot land.

Reconciling against the source

Where to look in Databricks:

Workflows → Job runs in the workspace UI, filtered to the last hour and to the Failed and Timed out statuses. The count there should match this card. databricks jobs list-runs on the Databricks CLI, or GET /api/2.1/jobs/runs/list, filtered by result_state and end_time, to reproduce the count programmatically. system.lakeflow.job_run_timeline system table for the authoritative historical record of every run’s terminal state, useful for post-incident reconstruction of the exact cascade order.

Why our count may legitimately differ from the Job runs page:

Reason	Direction	Why
Run-level vs task-level	Vortex IQ count lower	The card counts at run level; if you read the UI at task level a single multi-task failure can look like several.
TIMEDOUT inclusion	Vortex IQ count higher	We group `TIMEDOUT` with `FAILED`; if you filter the UI to Failed only, timeouts will be missing.
CANCELED exclusion	Vortex IQ count lower	Deliberately cancelled runs are excluded; the UI may show them under a combined filter.
Polling cadence	Brief lag	The card evaluates every minute; a failure in the last few seconds may not yet be counted.
Retry handling	Variable	A run that failed then succeeded on retry does not count; a run that exhausted retries counts once.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
Pipeline Lag vs Ecom Order Flow	A burst that breaks ingestion shows up as rising pipeline lag while orders continue.	Lag flat despite a burst means the failed jobs were not on the critical ingestion path.
Shopify / BigCommerce / Adobe order feeds	Order volume keeps flowing regardless of the burst; the data just stops landing.	A growing gap between source order count and landed rows quantifies the business impact of the stalled pipeline.

Known limitations / FAQs

Five of my jobs failed but the alert did not fire. Why? The trigger is strictly more than 5, so the sixth failure within the hour is what escalates the card. Exactly 5 failures sits just under the threshold. If your estate is small and you want earlier warning, lower the threshold in the Sensitivity tab; a 3-job burst is a reasonable setting for a workspace with only a handful of scheduled jobs. One job with twelve tasks failed and I expected a burst. Why is the count 1? The card counts at run level, not task level, on purpose. The failure of one task usually cascades to fail the remaining tasks in the same run, so counting tasks would make a single broken job masquerade as a burst. The burst signal is specifically designed to catch many separate jobs failing, which is the fingerprint of a shared root cause across your estate. A job failed, retried, and then succeeded. Is it in the count? No. Only runs whose final terminal state is FAILED or TIMEDOUT are counted. A run that recovered on retry is treated as a success. This keeps the burst signal focused on genuine, unrecovered failures rather than transient blips that the retry policy already absorbed. Are cancelled runs counted? No. CANCELED runs are excluded because cancellation is a deliberate human action, not a failure. If they were included, every time an engineer aborted a stuck run during an incident the alert would fire, adding noise during exactly the moment you want a clean signal. The burst points at six unrelated jobs with no shared dependency. What does that mean? That is the second burst shape and it usually points at shared infrastructure rather than a data cascade. Common causes: an expired service principal or PAT token breaking authentication across jobs, a metastore or Unity Catalog hiccup, a cluster pool that ran out of capacity so new clusters could not start, or a cloud-provider zone issue. Check token expiry and pool capacity first. Does this include Delta Live Tables pipeline failures? Not directly. This card reads the Jobs API runs. DLT pipelines have their own lifecycle and are tracked on DLT Pipeline Status Distribution. During a workspace-wide event you will often see both this card and the DLT card light up together; read them as one incident.

Tracked live in Vortex IQ Nerve Centre

Failed Job Burst (>5 failures in 1h) is one of hundreds of KPI pulses Vortex IQ tracks across Databricks and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre