Skip to main content
Card class: HeroCategory: Errors

At a glance

The percentage of SQL queries that failed rather than completed successfully, measured over a rolling five-minute window. This is the lakehouse’s “are queries working?” gauge. A small baseline of errors is normal (a user typos a column name, an ad-hoc query times out), but a sustained rate above 1% means something systemic is wrong: a table was dropped or renamed, a schema changed under a dashboard, the metastore is unreachable, or a warehouse is failing health checks. For a platform team this is a front-line outage detector, because a rising error rate is often the first thing that moves when an upstream change breaks downstream consumers.
What it tracksFailed queries as a percentage of total queries over the trailing five minutes, as dbx_query_error_rate, rendered as a gauge.
Data sourceDatabricks query history (GET /api/2.0/sql/history/queries) and system.query.history, where the system schema is enabled. The card divides queries with a failed/error status by all terminal queries in the window.
Why it mattersA rising error rate is the earliest sign that a schema change, a dropped object, or a metastore problem has broken downstream consumers. It is often the first card to move during a bad deploy.
Time window5m: a short rolling window so a spike surfaces within minutes rather than being diluted by a long history.
Alert trigger> 1%. Sustained error rate above 1% flags amber/red and pages the platform on-call.
SentimentLower is healthier. Zero to 1% is green; 1 to 5% is amber; above 5% is red and usually indicates a broken object or warehouse.
Rolesowner, engineering, operations (DBA / platform / SRE)

Calculation

Over the trailing five-minute window, Vortex IQ reads the query-history feed for the monitored warehouses and classifies each terminal query as success or failure based on its final status:
failed_queries = count(queries with status in {FAILED, ERROR, CANCELED-on-error})
total_queries  = count(all terminal queries in window)
error_rate_%   = (failed_queries / total_queries) * 100
Only terminal queries count: a query still running has no outcome yet and is excluded from both numerator and denominator, so an in-flight long query does not distort the rate. User-initiated cancellations (someone hits stop on a slow query) are treated as cancellations, not errors, because they reflect intent rather than failure; only cancellations forced by an error condition are counted as failures. The five-minute window is deliberately short. A schema break tends to fail every query that touches the affected object, so the rate climbs steeply and a short window makes that visible fast. The trade-off is that on a low-traffic warehouse, a handful of failures can produce a high percentage from a small base, so the card surfaces the absolute failed count alongside the percentage. Two failures out of five queries is 40% but is not the same emergency as 200 failures out of 500.

Worked example

A platform team runs a SQL warehouse serving about 60 analyst and dashboard queries per minute against a gold-layer schema. A data-engineering squad ships a model change at 13:00 UTC that renames gold.customer_360.lifetime_value to ltv_gbp. Snapshot taken on 23 Apr 26 around the deploy.
Time (UTC)Total queries (5m)FailedError rate %Note
12:5531220.6Baseline (typos, timeouts)
13:013054113.4Alert: spike after deploy
13:082983913.1Still failing
13:2030931.0Recovered after rollback
At 13:01 the gauge jumps from 0.6% to 13.4% and pages the on-call. The error sample in the drill-down is unambiguous.
Dominant error in the window:
  [ANALYSIS_ERROR] Column 'lifetime_value' cannot be resolved on
  gold.customer_360. Did you mean 'ltv_gbp'?
  - 38 of 41 failures share this message
  - All originate from 6 dashboards and 2 scheduled exports
  - Started within 60s of the 13:00 model deploy
The cause is clear: a backward-incompatible column rename broke every consumer still referencing the old name. The decisions:
  1. Roll back the rename, then migrate forward. The fastest path to green is to revert the model change so consumers work again, then reintroduce the rename as an additive change (add ltv_gbp as a copy, deprecate lifetime_value, update consumers, drop the old column later). The error rate returns to baseline by 13:20.
  2. Identify every broken consumer from the failures. The drill-down already names the six dashboards and two exports that failed, which is the exact migration checklist. There is no need to guess which downstream objects referenced the column.
  3. Add a contract check to the deploy. A pre-deploy test that runs each known consumer query against the proposed schema would have caught this before it shipped. The error rate spike is the symptom; the missing schema-contract test is the root cause.
Two takeaways:
  1. Read the percentage with the absolute count. 13.4% on 305 queries is a real, broad outage. The same 13.4% on a warehouse doing five queries in five minutes would be a single unlucky ad-hoc query, not an emergency. The card shows both for exactly this reason.
  2. Error rate is the fastest deploy-regression detector you have. It moved within 60 seconds of the breaking change, faster than anyone filed a ticket. Pair it with your deploy timeline so a spike immediately points at the change that caused it.

Sibling cards

CardWhy pair it with SQL Query Error RateWhat the combination tells you
SQL Query Error Rate Spike (>1% in 5m)The alert-class companion that fires on the spike.The gauge shows the level; the spike card is the paging event.
SQL Queries per Hour (live)Provides the denominator context for the rate.A throughput drop with a steady error count inflates the percentage misleadingly.
SQL Query Latency p95 (ms)Timeouts show up as both slow and failed.Errors plus rising p95 equals queries failing on timeout, not on schema.
SQL Warehouse Saturation %A saturated warehouse can reject or time out queries.Errors with high saturation equals capacity-driven failure; errors with low saturation equals a broken object.
Failed Jobs (24h)Jobs that run SQL fail for the same schema reasons.Both rising after a deploy confirms a backward-incompatible change.
Databricks Health ScoreThe composite that weights error rate heavily.A sustained error spike pulls the composite into red on its own.
Slow-Query Rate %Distinguishes “slow” from “broken”.High slow-query rate but low error rate equals performance, not correctness.

Reconciling against the source

Where to look in Databricks:
Open SQL → Query History and filter Status = Failed over a five-minute window; the failed count over the total is this card. Run SELECT execution_status, count(*) FROM system.query.history WHERE start_time >= now() - interval 5 minute GROUP BY execution_status (where the system schema is enabled) for the success/failure split. Each failed row in Query History exposes the error message and class, the same sample the card’s drill-down surfaces.
Why our number may legitimately differ from the Databricks UI:
ReasonDirectionWhy
Cancellation handlingVortex IQ rate lowerUser-initiated cancellations are excluded from our failures; the UI may group them under non-successful statuses.
Window definitionVariableVortex IQ uses a rolling five-minute window; the UI’s status filter often uses a fixed range you select, so the denominator differs.
Terminal-only countingSlightWe count only finished queries; an in-flight query that later fails appears at the next poll, briefly lagging the live UI.
System-query filteringVariableMetadata/housekeeping queries can be excluded from the denominator, which changes the percentage versus the raw history view.
Multi-warehouse aggregationVariableThe headline blends all monitored warehouses; a single-warehouse UI view will differ.
Cross-connector reconciliation:
CardExpected relationshipWhat causes divergence
Slow SQL Queries During Checkout WindowErrors on a storefront-facing warehouse can break embedded analytics during peak.Errors off-peak on internal warehouses are not customer-facing.
Databricks SQL Spike vs Ecom Order RateA traffic spike can push a warehouse to time out, raising errors.Errors with flat traffic point at a schema or object break, not load.

Known limitations / FAQs

My error rate is 40% but I am not worried. Should I be? Check the absolute count first. On a low-traffic warehouse, two failures out of five queries is 40% and may just be one analyst’s mistyped query running twice. The card shows the raw failed count alongside the percentage precisely so you can tell a small-base statistical artefact from a broad outage. A high percentage on a high query volume is the real emergency. Are user cancellations counted as errors? No. When a user deliberately stops a slow query, that is intent, not failure, and it is excluded from the error numerator. Only cancellations forced by an error condition (for example, a query killed because its warehouse went unhealthy) count as failures. This keeps the rate focused on genuine problems rather than normal analyst behaviour. Why a five-minute window instead of an hour? Schema breaks and object drops fail queries fast and broadly, so a short window makes the spike visible within minutes. A one-hour window would dilute a sharp spike against an hour of healthy queries and delay the alert. The trade-off, more sensitivity to small-base noise on quiet warehouses, is handled by surfacing the absolute count. What query failures are most common at baseline? The normal sub-1% baseline is dominated by user errors (mistyped column or table names, permission denials on objects a user cannot access) and the occasional ad-hoc query timing out. These are individual, scattered, and self-correcting. The signal you care about is a correlated spike where many queries fail with the same error, which points at a shared cause like a schema change. The error rate spiked but every query has a different error. What does that mean? Scattered, dissimilar errors usually mean an infrastructure problem rather than a single broken object: the metastore briefly unreachable, the warehouse failing health checks, or a network blip to cloud storage. A single shared error message points at a dropped or renamed object; many different messages point at the platform underneath. Pair with SQL Warehouse Saturation % and the warehouse event log. Does a timeout count as an error here? Yes, a query that fails because it exceeded its time limit terminates with a failed status and is counted. That is why a rising error rate alongside rising SQL Query Latency p95 (ms) usually means queries are failing on timeout under load, a capacity story, rather than failing on a broken object, a correctness story. The two patterns need different fixes. How do I connect a spike to the deploy that caused it? Line the spike’s start time up against your deployment timeline; schema-break spikes typically begin within a minute of the offending deploy. The drill-down’s shared error message names the broken object, and Query History attributes each failure to its consumer, giving you both the cause and the list of things to fix. For a guided trace, Vortex Mind can correlate the spike with recent change events automatically.

Tracked live in Vortex IQ Nerve Centre

SQL Query Error Rate % is one of hundreds of KPI pulses Vortex IQ tracks across Databricks and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.