At a glance
The percentage of SQL queries that failed rather than completed successfully, measured over a rolling five-minute window. This is the lakehouse’s “are queries working?” gauge. A small baseline of errors is normal (a user typos a column name, an ad-hoc query times out), but a sustained rate above 1% means something systemic is wrong: a table was dropped or renamed, a schema changed under a dashboard, the metastore is unreachable, or a warehouse is failing health checks. For a platform team this is a front-line outage detector, because a rising error rate is often the first thing that moves when an upstream change breaks downstream consumers.
| What it tracks | Failed queries as a percentage of total queries over the trailing five minutes, as dbx_query_error_rate, rendered as a gauge. |
| Data source | Databricks query history (GET /api/2.0/sql/history/queries) and system.query.history, where the system schema is enabled. The card divides queries with a failed/error status by all terminal queries in the window. |
| Why it matters | A rising error rate is the earliest sign that a schema change, a dropped object, or a metastore problem has broken downstream consumers. It is often the first card to move during a bad deploy. |
| Time window | 5m: a short rolling window so a spike surfaces within minutes rather than being diluted by a long history. |
| Alert trigger | > 1%. Sustained error rate above 1% flags amber/red and pages the platform on-call. |
| Sentiment | Lower is healthier. Zero to 1% is green; 1 to 5% is amber; above 5% is red and usually indicates a broken object or warehouse. |
| Roles | owner, engineering, operations (DBA / platform / SRE) |
Calculation
Over the trailing five-minute window, Vortex IQ reads the query-history feed for the monitored warehouses and classifies each terminal query as success or failure based on its final status:Worked example
A platform team runs a SQL warehouse serving about 60 analyst and dashboard queries per minute against a gold-layer schema. A data-engineering squad ships a model change at 13:00 UTC that renamesgold.customer_360.lifetime_value to ltv_gbp. Snapshot taken on 23 Apr 26 around the deploy.
| Time (UTC) | Total queries (5m) | Failed | Error rate % | Note |
|---|---|---|---|---|
| 12:55 | 312 | 2 | 0.6 | Baseline (typos, timeouts) |
| 13:01 | 305 | 41 | 13.4 | Alert: spike after deploy |
| 13:08 | 298 | 39 | 13.1 | Still failing |
| 13:20 | 309 | 3 | 1.0 | Recovered after rollback |
- Roll back the rename, then migrate forward. The fastest path to green is to revert the model change so consumers work again, then reintroduce the rename as an additive change (add
ltv_gbpas a copy, deprecatelifetime_value, update consumers, drop the old column later). The error rate returns to baseline by 13:20. - Identify every broken consumer from the failures. The drill-down already names the six dashboards and two exports that failed, which is the exact migration checklist. There is no need to guess which downstream objects referenced the column.
- Add a contract check to the deploy. A pre-deploy test that runs each known consumer query against the proposed schema would have caught this before it shipped. The error rate spike is the symptom; the missing schema-contract test is the root cause.
- Read the percentage with the absolute count. 13.4% on 305 queries is a real, broad outage. The same 13.4% on a warehouse doing five queries in five minutes would be a single unlucky ad-hoc query, not an emergency. The card shows both for exactly this reason.
- Error rate is the fastest deploy-regression detector you have. It moved within 60 seconds of the breaking change, faster than anyone filed a ticket. Pair it with your deploy timeline so a spike immediately points at the change that caused it.
Sibling cards
| Card | Why pair it with SQL Query Error Rate | What the combination tells you |
|---|---|---|
| SQL Query Error Rate Spike (>1% in 5m) | The alert-class companion that fires on the spike. | The gauge shows the level; the spike card is the paging event. |
| SQL Queries per Hour (live) | Provides the denominator context for the rate. | A throughput drop with a steady error count inflates the percentage misleadingly. |
| SQL Query Latency p95 (ms) | Timeouts show up as both slow and failed. | Errors plus rising p95 equals queries failing on timeout, not on schema. |
| SQL Warehouse Saturation % | A saturated warehouse can reject or time out queries. | Errors with high saturation equals capacity-driven failure; errors with low saturation equals a broken object. |
| Failed Jobs (24h) | Jobs that run SQL fail for the same schema reasons. | Both rising after a deploy confirms a backward-incompatible change. |
| Databricks Health Score | The composite that weights error rate heavily. | A sustained error spike pulls the composite into red on its own. |
| Slow-Query Rate % | Distinguishes “slow” from “broken”. | High slow-query rate but low error rate equals performance, not correctness. |
Reconciling against the source
Where to look in Databricks:
Open SQL → Query History and filter Status = Failed over a five-minute window; the failed count over the total is this card.
Run SELECT execution_status, count(*) FROM system.query.history WHERE start_time >= now() - interval 5 minute GROUP BY execution_status (where the system schema is enabled) for the success/failure split.
Each failed row in Query History exposes the error message and class, the same sample the card’s drill-down surfaces.
Why our number may legitimately differ from the Databricks UI:
| Reason | Direction | Why |
|---|---|---|
| Cancellation handling | Vortex IQ rate lower | User-initiated cancellations are excluded from our failures; the UI may group them under non-successful statuses. |
| Window definition | Variable | Vortex IQ uses a rolling five-minute window; the UI’s status filter often uses a fixed range you select, so the denominator differs. |
| Terminal-only counting | Slight | We count only finished queries; an in-flight query that later fails appears at the next poll, briefly lagging the live UI. |
| System-query filtering | Variable | Metadata/housekeeping queries can be excluded from the denominator, which changes the percentage versus the raw history view. |
| Multi-warehouse aggregation | Variable | The headline blends all monitored warehouses; a single-warehouse UI view will differ. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Slow SQL Queries During Checkout Window | Errors on a storefront-facing warehouse can break embedded analytics during peak. | Errors off-peak on internal warehouses are not customer-facing. |
| Databricks SQL Spike vs Ecom Order Rate | A traffic spike can push a warehouse to time out, raising errors. | Errors with flat traffic point at a schema or object break, not load. |