Slow-Query Rate %, Databricks - Vortex IQ Help Centre

Card class: Hero • Category: Performance

At a glance

Slow-Query Rate % is the share of SQL statements on your Databricks SQL warehouses that exceed a “slow” duration threshold, measured over a rolling 15-minute window. Where the latency-percentile cards tell you how slow the tail is, this card tells you how broad the slowness is. A p99 spike caused by one monster query barely moves the slow-query rate; a warehouse-wide slowdown pushes it up sharply. It is the single best “is this affecting lots of people or just one query?” signal in the Performance category.


What it tracks	The percentage of completed SQL statements classified as slow (duration above the slow threshold) over the window, for the warehouses in scope.
Data source	Databricks SQL query history (`system.query.history` / Query History API): count of statements over the slow threshold divided by total completed statements.
Time window	`15m` (rolling 15-minute window)
Alert trigger	`> 5%`. When more than 5% of queries are slow over the window, the on-call data engineer is notified.
Roles	owner, engineering
Card class	Hero and Sensitivity card: it drives the Performance health signal and both the slow threshold and the 5% alert level are configurable in the Sensitivity tab.

Calculation

Over the rolling 15-minute window, Vortex IQ reads completed statements from the warehouse query history, counts how many had a total duration above the configured “slow” threshold, and divides by the total number of completed statements:

Slow-Query Rate % = (queries with total_duration > slow_threshold)
                    / (total completed queries) * 100

“Total duration” is the same full wall-clock measure used by the latency-percentile cards: queue wait plus compilation plus execution plus result fetch. The slow threshold is a duration (for example 5 seconds) set in the Sensitivity tab; it defines what “slow” means for your workload and is independent of the 5% alert level, which defines how many slow queries you tolerate. This is a rate, not a latency, so it is robust to a single very slow query. One 60-second outlier in a window of 2,000 fast queries is a slow-query rate of 0.05%, well below threshold, even though it would dominate p99. That separation is the point: the percentile cards catch severity, this card catches prevalence.

Worked example

An online grocer runs a shared Serverless Small SQL warehouse serving operational dashboards across merchandising, supply chain, and finance. The slow threshold is set to 5 seconds. Snapshot taken on 28 May 26 at 16:20 BST.

Reading	Value
Total queries in window	1,640
Queries over 5s	138
Slow-Query Rate	8.4% (alert: above 5%)
p50 latency	1,900ms
p95 latency	7,100ms
Warehouse saturation	84%

The card is red at 8.4%, well over the 5% threshold, and the picture across the panel tells a coherent story. Unlike the isolated-p99 case, here the rate, p50, p95, and saturation all moved together.

The slowness is broad, not a single offender. 138 of 1,640 queries are slow. That is not one bad query; a meaningful chunk of the whole workload is degraded. p50 has risen to 1.9s (the typical query is now slow-ish too), confirming the problem is system-wide.
Saturation at 84% points to the cause. The warehouse is near capacity. With many teams hitting the same small warehouse at 16:20 (end-of-day reporting crunch), queries queue and a growing fraction tip over the 5-second line.
The lever is capacity allocation. Because the cause is load on a shared warehouse, the fix is structural: enable multi-cluster auto-scaling so the warehouse adds a cluster during the end-of-day crunch, or split the heaviest team (finance’s large aggregations) onto its own warehouse so it stops crowding out lightweight merchandising dashboards.

Prevalence vs severity, side by side:
  - Slow-Query Rate 8.4%  -> BROAD slowness (many users affected)
  - p95 7.1s             -> the slow ones are genuinely slow
  - p50 1.9s             -> even typical queries are degraded
  - Saturation 84%       -> capacity is the bottleneck
  -> Diagnosis: shared warehouse overloaded at peak.
     Fix: multi-cluster auto-scale OR split heavy team to own warehouse.

Three takeaways:

Slow-Query Rate measures breadth; percentiles measure depth. A high rate means many users are affected. Always read it alongside SQL Query Latency p95 (ms) and SQL Query Latency p99 (ms) to know both how widespread and how severe the slowness is.
High rate plus high saturation equals a capacity problem. High rate with low saturation, by contrast, points to degraded table layout or missing data pruning, which is a query/data fix, not a scaling one.
The threshold is two numbers, set both. The slow-duration threshold defines “slow” for your workload, and the 5% alert defines your tolerance. A warehouse of heavy aggregations may use a higher slow threshold; a latency-sensitive BI warehouse may use a tighter one.

Sibling cards

Card	Why pair it with Slow-Query Rate	What the combination tells you
SQL Query Latency p95 (ms)	The severity of the tail.	High rate plus high p95 equals broad and deep slowness; high rate alone means many queries just over the line.
SQL Query Latency p99 (ms)	The extreme tail.	Low rate plus high p99 equals one or two pathological queries, not a broad problem.
SQL Query Latency p50 (ms)	The median baseline.	A rising p50 alongside the rate confirms even typical queries are now slow.
SQL Warehouse Saturation %	The capacity cause.	High rate plus high saturation equals overload; high rate plus low saturation equals table-layout/query problems.
Avg Cluster CPU Utilisation %	The compute-pressure peer.	Confirms whether the warehouse is CPU-bound during the slow window.
Top 10 Slowest SQL Queries	The named offenders.	Identifies the statements making up the slow fraction so you can rewrite or reschedule them.
SQL Query Error Rate %	The failure peer.	A rising error rate alongside slow-query rate means queries are starting to time out, not just run slowly.
Slow SQL Queries During Checkout Window	The revenue cross-channel view.	Tells you whether the slow fraction overlaps live checkout traffic.

Reconciling against the source

Where to look in Databricks:

Query History in the Databricks SQL workspace, filtered to the same warehouse and 15-minute range: count the statements with duration above your slow threshold against the total to reproduce the rate. system.query.history (Unity Catalog system tables) is the exact source; a single query reproduces the card. Warehouse monitoring on the warehouse page shows live queue depth and cluster count, which explain a load-driven rate spike.

To match the card precisely:

SELECT
  100.0 * COUNT_IF(total_duration_ms > 5000) / COUNT(*) AS slow_query_pct
FROM system.query.history
WHERE warehouse_id = '<your_warehouse_id>'
  AND start_time >= current_timestamp() - INTERVAL 15 MINUTES;

(Replace 5000 with whatever slow threshold you have configured in the Sensitivity tab.) Why our number may legitimately differ from the Databricks UI:

Reason	Direction	Why
Slow-threshold definition	Variable	The rate depends entirely on your configured slow threshold; if you compare against a different cut-off in the UI, the percentages will not match.
Duration definition	Vortex IQ may read more queries as slow	We use total duration including queue wait; an execution-time-only comparison classifies fewer queries as slow.
System-table latency	Brief lag	`system.query.history` can lag completion by a few seconds, so the most recent statements may be missing from a live reading.
Denominator scope	Variable	We divide by completed statements; if you include cancelled/failed statements or metadata-only commands, the denominator (and the rate) shift.
Time zone / window edges	Marginal	Vortex IQ aligns the 15-minute window to your reporting time zone.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
`shopify.total_revenue` / `bigcommerce.total_revenue`	A broad slow-query rate spike during peak browsing can correspond to degraded storefront features if the lakehouse feeds them synchronously.	Revenue steady during a rate spike means the slowness is internal-only (BI/reporting), not customer-facing.
`google_analytics`	Independent front-end timing measurement.	Lakehouse rate high but GA4 timings normal equals back-office-only impact.

Known limitations / FAQs

How is “slow” defined? Is 5 seconds the threshold? “Slow” is a duration threshold you set in the Sensitivity tab; a common default is 5 seconds. It is separate from the 5% alert level. The 5% governs how many slow queries you tolerate; the slow-duration threshold governs what counts as slow in the first place. Tune both to your workload: a heavy-aggregation warehouse may use a higher slow threshold than a latency-sensitive BI one. My p99 spiked but the slow-query rate barely moved. Why? Because they measure different things. p99 is severity (how slow the worst 1% is), and one extreme query can dominate it. Slow-query rate is prevalence (what fraction is slow), and one outlier in thousands of queries is a negligible fraction. A high p99 with a low slow rate is the signature of a small number of pathological queries; investigate via Top 10 Slowest SQL Queries rather than scaling. The rate is high but warehouse saturation is low. What does that mean? That rules out load as the cause. When many queries are slow but the warehouse is not busy, the usual culprits are data-side: tables with too many small files, missing partition pruning, stale statistics, or absent Z-ORDER on common filter columns. The fix is OPTIMIZE / Z-ORDER and better table layout, not a bigger warehouse. Does a single user running lots of bad queries skew the rate? It can. If one analyst fires fifty unfiltered full-table-scan queries in the window, they alone can push the rate over 5% even on a healthy warehouse. Use Query History grouped by user to spot this; the fix is to coach the user or move ad-hoc exploration onto a separate warehouse so it does not affect the shared rate. Why a 15-minute window rather than real-time? A rate needs enough queries to be statistically meaningful. Over a few seconds, a quiet warehouse might run only three queries; one slow query would read as a 33% rate and constant false alarms. The 15-minute window gives a stable denominator while still being responsive enough to catch a developing slowdown. Can the rate be misleadingly low during an outage? Yes. If queries are failing or being cancelled before completion, they may drop out of the completed-statement denominator and instead appear as errors. A suspiciously low slow rate alongside a rising SQL Query Error Rate % is a sign that queries are failing rather than finishing slowly. Always read the two together during an incident. Should ETL and BI traffic share one Slow-Query Rate card? Ideally no. Heavy ETL transformations are legitimately slow and will inflate the rate, masking genuine BI degradation. Stack the card per warehouse, or scope the connector so a high-throughput interactive warehouse is measured separately from a batch-ETL one, and set an appropriate slow threshold for each.

Tracked live in Vortex IQ Nerve Centre

Slow-Query Rate % is one of hundreds of KPI pulses Vortex IQ tracks across Databricks and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre