At a glance
The percentage of Databricks job runs that completed successfully over the last 24 hours, shown as a gauge. This is the Databricks-distinctive defining metric for pipeline health: if scheduled runs are failing, your data pipelines are broken and every downstream table, dashboard, and feature store is at risk of going stale. A single number gives the platform owner an instant read on whether the lakehouse is delivering its contracts. The alert fires below 95%, the floor at which the team should assume something systemic is wrong rather than a one-off flake.
| Data source | Databricks Jobs API, GET /api/2.2/jobs/runs/list. Each completed run carries a terminal state.result_state; the card computes the ratio of SUCCESS runs to all completed runs in the window. |
| Formula | SUCCESS runs / (SUCCESS + FAILED + TIMEDOUT runs) × 100, over the rolling 24-hour window. |
| What counts in the denominator | All runs that reached a terminal state in the window: SUCCESS, FAILED, and TIMEDOUT. These are the runs that had a chance to succeed or fail. |
| What is excluded | CANCELED runs (intentional, not a quality signal) and runs still RUNNING / PENDING (no terminal state yet). Including cancellations would distort the rate with deliberate human actions. |
| Aggregation window | Rolling 24 hours, refreshed each polling cycle. The gauge reflects the live ratio across that window. |
| Time window | 24h (rolling 24 hours) |
| Alert trigger | < 95%. When success rate drops below 95% the card flags it; pair with Failed Jobs (24h) for the worklist of exactly which runs failed. |
| Roles | owner, platform engineering, data engineering, operations |
Calculation
The success rate is the share of terminal runs that ended inSUCCESS:
- Cancellations are excluded so the rate measures quality, not activity. A run that a human or an upstream task cancelled is neither a success nor a failure of the pipeline; it is a deliberate act. Including cancellations would let a busy operations day drag the rate down even when nothing is genuinely broken. The denominator is only runs that actually attempted to complete.
- The rate is run-weighted, not job-weighted. A job that runs hourly contributes 24 runs to the window; a daily job contributes one. This is intentional: a flaky high-frequency job has more downstream impact (more stale refreshes) and should move the gauge more than a single daily job. To see per-job patterns instead, use Top 10 Failing Workflows (7d).
- Small denominators are volatile. With only 8 runs in the window, one failure is a 12.5-point drop. The gauge shows the absolute rate, but the card annotates the run count so a 87.5% from 7-of-8 is read differently from 87.5% across 400 runs. Low-volume workspaces should weight the failed-run worklist over the headline percentage.
Worked example
A data platform team runs a busy lakehouse: hourly micro-batches, several nightly ETL jobs, and a feature pipeline. Snapshot taken 18 Apr 26 at 08:00, covering the previous 24 hours.| Result state | Runs |
|---|---|
| SUCCESS | 184 |
| FAILED | 9 |
| TIMEDOUT | 2 |
| CANCELED (excluded) | 5 |
- 94.4% is below the floor, so this is not a normal day. In a healthy steady state this workspace sits at 98 to 99%. The drop of four to five points means a cluster of failures, not background flakiness. The first move is to open Failed Jobs (24h) and see whether the 11 failures share a root cause.
- The worklist shows 7 of the 11 failures are the same hourly job.
prod_orders_hourlyfailed seven times overnight on the same schema-mismatch error. That is one broken pipeline expressing itself as seven failed runs, which is why a single bug dropped the rate so far: the high-frequency job dominates the denominator. Fixing that one job recovers most of the gap. Cross-check Failed Job Burst (>5 failures in 1h) to confirm whether the seven clustered in one hour (a cascade) or spread evenly (a deterministic per-run bug). - The two timed-out runs are a separate, slower problem. A nightly feature build hit its timeout twice. That is a duration / data-volume issue, not a code error; it is worth a same-day look via Long-Running Jobs (>1h) but is not what dragged the gauge down.
Sibling cards to read alongside
| Card | Why pair it with Job Success Rate | What the combination tells you |
|---|---|---|
| Failed Jobs (24h) | The worklist behind the percentage. | The rate says how bad; the worklist says which runs to fix first. |
| Failed Job Burst (>5 failures in 1h) | Detects whether failures clustered into a cascade. | A burst plus a rate drop signals a dependency chain breaking, not isolated bugs. |
| Top 10 Failing Workflows (7d) | The weekly per-job pattern under the daily rate. | A chronic offender keeps the rate suppressed day after day. |
| Long-Running Jobs (>1h) | The leading signal for TIMEDOUT runs that hurt the rate. | Catching duration creep early prevents future timeouts. |
| Pipeline Lag (since last success) | The downstream consequence of a low success rate. | A falling rate plus rising lag quantifies data staleness. |
| Databricks Health Score | The composite that weights success rate heavily. | A sub-95% success rate is one of the largest drags on the overall score. |
| Pipeline Lag vs Ecom Order Flow | The cross-channel impact view. | A low success rate while orders keep flowing is the highest-urgency case. |
Reconciling against the source
Where to look in Databricks:Workflows → Job runs with the 24-hour filter shows every run and its result state; countingA reconciling query you can run in a Databricks SQL editor:Succeededagainst the total ofSucceeded + Failed + Timed outreproduces the rate. System tables:system.lakeflow.job_run_timelineholds run-level terminal states for an exact SQL reconcile. Each job’s Runs tab shows the per-job success history, useful for confirming which job is dragging the rate.
| Reason | Direction | Why |
|---|---|---|
| Cancellation handling | Vortex IQ may read higher | The card excludes CANCELED from the denominator; a UI tally that includes cancellations as non-successes would show a lower rate. |
TIMEDOUT in denominator | Vortex IQ may read lower | The card counts timed-out runs as failures in the denominator; a calculation counting only FAILED would read higher. |
| Window edge / time zone | Small drift | Rolling 24h from the current minute vs the UI’s calendar-day or last-N-hours buckets, plus workspace vs UTC alignment near midnight. |
| Run-weighting | Perception difference | The card weights by run, so a high-frequency job dominates. A per-job average in the UI will look different even on the same data. |
| System-table lag | Vortex IQ live, table delayed | The live Jobs API updates within seconds; system.lakeflow.* can trail by minutes. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Failed Jobs (24h) | Failed-run count and success rate move inversely over the same population. | If they disagree, check cancelled runs or window misalignment. |
| Pipeline Lag vs Ecom Order Flow | A sustained sub-95% rate should show as rising lag against steady order flow. | Lag flat despite a low rate means the failing jobs are non-load-bearing. |