At a glance
Top 10 Failing Workflows (7d) ranks the ten Databricks Jobs (workflows) with the most failed runs over the trailing 7 days, broken down by row. It is the triage shortlist for a platform or data-engineering team: rather than wading through every job, you see at a glance which scheduled pipelines are costing you the most failures, so you can fix the chronic offenders first.
| What it tracks | The ten Jobs with the highest count of runs ending in result_state = FAILED or TIMEDOUT over the last 7 days, one row per Job, ordered by failure count descending. Sourced from the Jobs API /api/2.1/jobs/runs/list, filtered to terminal failed states. |
| Time window | 7d (trailing 7 days, refreshed on the standard data refresh). |
| Alert trigger | None. This is a ranking table for triage, not a threshold alert. For the live failure signal use the Failed Job Burst and Failed Jobs (24h) cards. |
| Roles | engineering, operations |
What it tracks
Each row is a single Databricks Job (a named workflow that may chain several tasks) together with how many of its runs failed in the trailing 7 days. The engine reads run history from/api/2.1/jobs/runs/list, counts the runs whose result_state is FAILED or TIMEDOUT, groups by job_id, and returns the top ten by count. A workflow that runs hourly and fails twice a day will out-rank a daily workflow that fails once, which is intentional: the card surfaces volume of failure, the thing that erodes trust in your data freshness. Read it weekly alongside Job Success Rate (24h) to separate one bad day from a structural problem, and use Top 10 Slowest SQL Queries when failures are actually timeouts in disguise.
Reconciling against the source
To verify in Databricks natively, open Workflows → Jobs, sort the job list by recent run status, or query thesystem.lakeflow.job_run_timeline system table (Unity Catalog) filtered to result_state IN ('FAILED','TIMEDOUT') over the last 7 days and grouped by job_id. Counts can differ slightly because Vortex IQ uses a 7-day rolling window in your reporting time zone while the Workflows UI defaults to a fixed run-count page in workspace time.