At a glance
The average Databricks Units consumed per job run over the trailing 7 days, compared against the prior 7-day period. Where DBU Burned (24h) tells you the total spend, this card tells you the unit economics: how expensive a single job run has become. It is the efficiency metric FinOps and data engineering watch together, because a rising cost-per-run is the early warning that pipelines are growing heavier (bigger data, worse plans, larger clusters) before the total bill makes it obvious.
| Data source | Billable usage API joined to the Jobs Runs API (GET /api/2.1/jobs/runs/list). DBU consumption per run is taken from the billing-usage records; the run count comes from completed runs in the window. Where enabled, system.billing.usage plus system.workflow.jobs provide the same join for historical reconciliation. |
| Metric basis | total job-run DBU in window / number of completed job runs in window. A simple mean across runs, not weighted by run duration. |
| Aggregation window | Trailing 7 days, compared against the prior 7 days (7d vsP). |
| Comparison | Week-over-week. The card surfaces the percentage change against the matching prior period. |
| What counts | Completed job runs (success, failed, and timed-out) and the DBU billed against their compute. |
| What does NOT count | Interactive (all-purpose) cluster usage outside of jobs, SQL warehouse DBU (those have their own cards), and runs still in progress at the window edge. |
| Time window | 7d vsP (trailing 7 days vs prior 7 days) |
| Alert trigger | +25% WoW (cost per run rising week-over-week, investigate before the total bill follows) |
| Roles | FinOps, data engineering, platform engineering |
Calculation
The engine sums the DBU billed against job compute over the trailing 7 days and divides by the number of completed job runs in the same window:+25% WoW alert fires on genuine efficiency regressions rather than on the normal Monday-vs-Sunday shape of a workload. A breach means each run is consuming materially more DBU than it did last week: usually larger input data, a regressed query plan, a cluster that was sized up, or autoscaling that is overshooting.
Worked example
A platform team reviews the card on 21 Apr 26. The headline shows a week-over-week jump that trips the alert.| Window | Total job-run DBU | Completed runs | Avg DBU/run |
|---|---|---|---|
| Prior 7d (07 to 13 Apr) | 4,200 | 1,400 | 3.0 |
| This 7d (14 to 20 Apr) | 5,880 | 1,400 | 4.2 |
- DBU by Cluster (7d) isolates the culprit cluster. It shows the
prod-etl-nightlycluster jumped from a fixed 8 nodes to autoscaling up to 16 mid-week, after a new upstream feed roughly doubled the input volume on one table. - The fix is a plan change, not a node change. The new feed landed as thousands of small files; the nightly load was reading them all on every run. Compacting them with a scheduled
OPTIMIZE(tracked by Last Delta Lake Vacuum / Optimize) cut the read cost and pulled avg DBU/run back toward 3.2 the following week. - Confirm no run-time regression came with it. Long-Running Jobs (>1h) stayed flat, so the extra DBU was scale-out cost, not jobs hanging.
Sibling cards to reference together
| Card | Why pair it with Avg DBU per Job Run | What the combination tells you |
|---|---|---|
| DBU Burned (24h) | The total-spend headline this per-run figure feeds. | Rising total with flat per-run equals more runs; rising per-run with flat total equals fewer but heavier runs. |
| DBU by Cluster (7d) | Localises which cluster drove the per-run change. | The cluster topping the DBU table is where the efficiency regression lives. |
| Idle Cluster DBU Wasted (24h) | Separates useful per-run cost from idle waste. | High per-run plus high idle waste means the cost is partly clusters sitting open between runs. |
| Avg Cluster CPU Utilisation % | The utilisation behind the cost. | Per-run cost up with utilisation down means over-provisioned nodes; up with utilisation up means genuinely heavier work. |
| Long-Running Jobs (>1h) | Run duration is a direct DBU multiplier. | A per-run cost rise tracking longer runs points to a plan or data-volume regression. |
| Job Success Rate (24h) | Failed and retried runs still burn DBU. | Falling success rate inflates per-run cost through wasted retries. |
| Last Delta Lake Vacuum / Optimize | Table maintenance is a common per-run cost lever. | Stale OPTIMIZE plus rising per-run cost equals small-file overhead on reads. |
Reconciling against the source
Where to look in Databricks:Settings → Usage (the account-level usage dashboard) for total job-compute DBU over a custom range.Why our number may legitimately differ from the Databricks UI:system.billing.usagejoined tosystem.workflow.jobsfor the per-run DBU series if system tables are enabled. Workflows → Jobs → Runs for the completed-run count over the same window.
| Reason | Direction | Why |
|---|---|---|
| Mean vs duration-weighted | Variable | Vortex IQ reports a simple per-run mean; a usage dashboard that totals DBU and lets you eyeball it against run count will not weight the two the same way you might by hand. |
| Window boundary | Variable | We use a rolling trailing-7-day window vs prior 7 days; usage dashboards default to calendar periods, so the run set differs at the edges. |
| Billing lag | Vortex IQ slightly lower near the edge | Billable usage records can lag the run by a short interval, so the most recent runs may not yet carry their full DBU at read time. |
| In-progress runs | Vortex IQ excludes | Runs still executing at the window edge are not counted; the usage dashboard may show partial DBU for them. |
| Time zone | Window alignment | Native dashboards use the account time zone; Vortex IQ stores UTC and renders in your profile time zone. |
Known limitations / FAQs
My per-run cost jumped but I did not change anything. What happened? The usual cause is upstream data growth. The same job reading more rows, more partitions, or more small files costs more DBU per run even with identical code and cluster config. Check the input volume of the heaviest job in DBU by Cluster (7d), and if small files are the issue, schedule a regularOPTIMIZE.
Why a mean rather than a duration-weighted average?
A simple per-run mean is the metric a budget owner reasons about: “what does one run of our pipeline cost on average?” Duration-weighting would let a handful of very long runs dominate and obscure the typical case. For the long-tail view, read Long-Running Jobs (>1h) alongside it.
Failed runs are dragging my average up. Is that correct?
Yes, and it is intentional. A failed or timed-out run still consumed compute before it died, so it still cost DBU. Counting it keeps the average honest about money spent. If failures are inflating the figure, the fix is to raise Job Success Rate (24h), not to exclude the cost.
Does this include interactive notebook usage?
No. This card covers job-cluster runs only. Ad-hoc work on all-purpose clusters and SQL warehouse queries are billed and tracked separately. If your engineers run heavy exploratory work interactively, it will not appear here even though it shows in the total bill.
The +25% alert fired but my total bill is flat. Should I care?
Yes, this is the early-warning case the card exists for. Flat total with rising per-run means run volume fell while each run got heavier. The total has not moved yet only because fewer runs masked it; when volume returns to normal, the bill will jump. Investigate now.
Can I change the +25% threshold?
Yes. The week-over-week sensitivity is configurable per profile in the Sensitivity tab. Teams with deliberately variable workloads (seasonal loads, backfills) often widen it to avoid firing on expected swings, while cost-sensitive teams tighten it.
Why is my per-run figure lower than expected right after midnight?
Billing-usage records lag completed runs slightly, so runs that finished in the last few minutes may not yet carry their full DBU. The figure settles as the billing data catches up; reconcile against system.billing.usage after the lag clears rather than at the window edge.