At a glance
The mean CPU utilisation across all running clusters in the workspace over the last hour, weighted by node count. For a platform team this is the single most actionable right-sizing signal Databricks exposes: clusters sitting under 30% are paying for cores they never use, and clusters pinned above 90% are throttling jobs and inflating run times. The whole point of this card is to keep you in the healthy band where you pay for compute you actually consume without queuing work behind saturated nodes.
| Data source | Cluster node metrics streamed to Databricks via the Ganglia / built-in cluster-metrics agent, surfaced through the Clusters API (GET /api/2.0/clusters/list for the running fleet) joined to per-node CPU samples. Where the workspace exposes it, the system.compute.node_timeline system table provides the same CPU series for historical reconciliation. |
| Metric basis | CPU busy percentage (100 - idle%) averaged per node, then averaged across all nodes in all RUNNING clusters, weighted by node count so a 20-node cluster counts more than a 2-node cluster. Driver and worker nodes are both included. |
| Aggregation window | Rolling 1 hour, refreshed on the standard cluster-metrics cadence (roughly every 60 seconds for the live gauge). |
| Healthy band | 30% to 90%. Below 30% is under-utilised (over-provisioned or idle autoscaling floor); above 90% is saturated (jobs CPU-bound, run times stretching). |
| What counts | All clusters in RUNNING state: job clusters, all-purpose (interactive) clusters, and the driver of each. |
| What does NOT count | Terminated or pending clusters, SQL warehouses (those have their own saturation card), serverless compute where node-level CPU is not exposed, and clusters in RESTARTING state during the restart gap. |
| Time window | 1h (rolling, refreshed roughly every 60 seconds) |
| Alert trigger | <30% (under-utilised, right-size down) or >90% (saturated, scale up or split workload) |
| Roles | platform engineering, data engineering, FinOps |
Calculation
The engine pulls the running-cluster list from the Clusters API, then for each cluster reads the per-node CPU-busy samples over the trailing hour. Each node’s busy percentage is the complement of its idle time (100% - idle%, which folds together user, system, and iowait). Node values are averaged within a cluster, then the cluster averages are combined into one workspace figure weighted by node count:
workspace_avg% against the 30/90 band so the colour tells you the action before you read the number.
Worked example
A retail analytics platform team runs three clusters during the working day. Snapshot taken on 14 Apr 26 at 09:40 BST, trailing-hour view.| Cluster | Type | Nodes | Trailing-1h CPU avg | Read |
|---|---|---|---|---|
prod-etl-nightly | Job | 24 | 22% | Under-utilised: the nightly load finished at 06:00, the cluster is held open by a 4-hour auto-termination window doing nothing |
prod-bi-interactive | All-purpose | 8 | 94% | Saturated: analysts are running ad-hoc joins against un-optimised tables, CPU pinned, query times climbing |
prod-ml-feature | Job | 4 | 58% | Healthy band |
prod-etl-nightlyat 22% on 24 nodes is pure waste. The auto-termination window is too long for a job that finishes at 06:00. Cutting auto-termination from 4 hours to 20 minutes reclaims roughly 3.5 hours of 24-node compute every day. Pair this with Idle Cluster DBU Wasted (24h) to put a DBU number on it.prod-bi-interactiveat 94% is throttling analysts. Two fixes: enable autoscaling so it adds workers under load, or push the heavy ad-hoc queries onto a SQL warehouse where saturation is managed separately. Confirm the pain in SQL Query Latency p95 (ms) and Slow-Query Rate %.
Sibling cards to reference together
| Card | Why pair it with Avg Cluster CPU Utilisation | What the combination tells you |
|---|---|---|
| DBU by Cluster (7d) | Turns the utilisation percentage into a cost ranking per cluster. | A low-utilisation cluster near the top of the DBU table is the highest-value right-sizing target. |
| Idle Cluster DBU Wasted (24h) | Quantifies the spend behind a sub-30% reading. | Sub-30% utilisation plus high idle DBU equals an auto-termination problem you can cost. |
| DBU Burned (24h) | The total burn this utilisation is producing. | Rising burn with flat utilisation means more clusters, not busier ones. |
| SQL Warehouse Saturation % | The warehouse-side equivalent for SQL compute. | Use both to see whether saturation lives in clusters or warehouses before scaling either. |
| SQL Query Latency p95 (ms) | The user-visible symptom of saturation. | High CPU plus high p95 confirms the cluster is the bottleneck, not the query plan. |
| Long-Running Jobs (>1h) | Saturated CPU stretches run times. | Over-90% utilisation co-occurring with long-running jobs points to CPU-bound work. |
| Active Clusters | The denominator: how many clusters this average covers. | A low average across many clusters often means several idle ones inflating the count. |
Reconciling against the source
Where to look in Databricks:Compute → Clusters → (select cluster) → Metrics tab for the native per-cluster CPU chart (driver and worker breakdown).Why our number may legitimately differ from the Databricks UI:system.compute.node_timelinesystem table for the historical per-node CPU series if system tables are enabled in your account. Clusters API (GET /api/2.0/clusters/list) to confirm which clusters were inRUNNINGstate during the window.
| Reason | Direction | Why |
|---|---|---|
| Per-cluster vs weighted fleet | Variable | The native Metrics tab shows one cluster at a time; Vortex IQ reports a node-weighted average across all running clusters, so the headline rarely matches any single chart. |
| Driver inclusion | Vortex IQ slightly lower or higher | We include the driver node in the average; if you read only worker charts in the UI, busy-worker clusters look higher there. |
| Sampling cadence | Marginal | The live gauge polls roughly every 60 seconds; the system table aggregates at a coarser interval, so a transient spike can appear in one and not the other. |
| Time zone | Chart alignment | Native charts use the workspace time zone; Vortex IQ stores UTC and renders in your Vortex IQ profile time zone. |
| Serverless exclusion | Vortex IQ may read lower coverage | Serverless compute does not expose node-level CPU; those workloads are absent from this average. |
Known limitations / FAQs
My average sits at 45% but my Databricks bill keeps climbing. How? A healthy average can hide a fleet of half-idle clusters. Forty-five percent across ten clusters is very different from 45% across two. Always drill into DBU by Cluster (7d): the bill is driven by node-hours, not by the average percentage, so several lightly used clusters cost more than one busy one at the same headline number. Should I aim for 100% utilisation to maximise value? No. The healthy ceiling is 90%. Past that, jobs queue on CPU, autoscaling lags behind demand, and run times stretch non-linearly. A cluster pinned at 100% is usually a cluster that needs more workers or a workload that needs splitting, not a win. Target the middle of the 30 to 90 band. Why is serverless compute missing from this average? Serverless does not expose node-level CPU metrics to the workspace; Databricks manages the underlying nodes. The card covers classic job and all-purpose clusters only. For serverless workloads, lean on DBU-based cards instead, which do capture serverless consumption. My nightly job cluster reads 0% for most of the day. Is that a fault? Only if the cluster is still running. A 0% reading on aRUNNING cluster means it is alive but doing nothing, which is exactly the auto-termination signal this card is built to surface. Shorten the auto-termination window. If the cluster has terminated, it correctly drops out of the average.
Does iowait count as busy?
Yes. We compute busy as 100% - idle%, which rolls user, system, and iowait together. A cluster stuck on iowait (waiting on slow storage or shuffle) reads as busy here even though the CPUs are stalled. If utilisation is high but throughput is low, suspect IO, not compute, and check shuffle and storage in the native cluster metrics.
The alert fired at 91% for two minutes then cleared. Should I act?
A brief spike during a join or shuffle is normal and self-clears as autoscaling responds. The card surfaces sustained breaches over the rolling hour, not single samples. Act when utilisation holds above 90% across the window, not on a transient blip.
Can I change the 30/90 thresholds?
Yes. The under-utilised and saturated thresholds are configurable per profile in the Sensitivity tab. A team running deliberately bursty batch work may lower the floor; a latency-sensitive interactive team may lower the ceiling to leave headroom.