At a glance
An alert that fires when the share of SQL queries returning an error on the connected SQL warehouses exceeds 1% sustained over a rolling 5-minute window. SQL warehouses serve dashboards, BI tools, and ad-hoc analytics, so a query error-rate spike means real people are watching dashboards fail to load or returning empty. A 1% error rate is low in absolute terms but high for a healthy warehouse: well-behaved estates sit far below it, so crossing 1% and holding there is a clear “something just broke” signal.
| Data source | Databricks system.query.history system table (the authoritative record of every SQL statement, its execution_status, and its error class), aggregated to error-share per 5-minute bucket. The live SQL Warehouses API provides warehouse state for context. |
| Metric basis | A ratio: errored_queries / total_queries within the rolling window, expressed as a percentage. The alert is a spike detector on that ratio, not a raw error count. |
| Aggregation window | 5m rolling, evaluated continuously. |
| Alert trigger | >1% sustained 5m. A momentary blip above 1% that does not persist for the window is suppressed; the rate must hold above the threshold to escalate. |
| What counts as an error | Queries whose execution_status is FAILED (syntax errors, permission denials, analysis errors, runtime errors such as division by zero, and out-of-memory / spill failures). |
| What does NOT count | (1) Successful queries; (2) CANCELED queries (a user or a BI tool aborted them deliberately, often on timeout); (3) queries still RUNNING or QUEUED; (4) metadata-only operations the warehouse does not log as statements. |
| Why “sustained” | A single failed query in a quiet minute can momentarily push the ratio above 1%. Requiring the rate to hold across the window filters that statistical noise out, so the alert fires on genuine sustained degradation, not one unlucky query. |
| Time zone | Workspace time zone for chart axes; UTC for cross-connector windowing. |
| Time window | 5m rolling. |
| Roles | owner, platform engineering, analytics / BI on-call |
Calculation
The engine computes the error share over a rolling 5-minute window of completed SQL statements:CANCELED query is almost always a deliberate abort (a user closed a dashboard, a BI tool hit its own client-side timeout) and is not a warehouse failure; counting it would inflate the error rate and fire false alarms during normal interactive use.
The “sustained” qualifier is what makes the threshold usable at 1%. In a low-traffic 5-minute window with only, say, 40 queries, a single failure is already 2.5%. Without the sustain requirement the alert would fire constantly during quiet periods on a single transient error. By requiring the rate to hold above 1% across the rolling window, the engine distinguishes a genuine fault (errors keep arriving) from a one-off (one query failed, the rate decays back below threshold on the next evaluation).
This card measures correctness (are queries erroring), which is distinct from speed (are queries slow). A warehouse can have a 0% error rate while every query is painfully slow, that latency view lives on SQL Query Latency p95 (ms) and Slow-Query Rate %. The standing gauge for this same ratio is SQL Query Error Rate %; this card is the alerting wrapper that escalates when the gauge spikes.
Worked example
A platform team runs a SQL warehouse serving the company’s BI estate: executive revenue dashboards, a merchandising team’s stock-availability views, and ad-hoc analyst queries. Snapshot taken on 14 Apr 26 at 14:05 BST, mid-afternoon peak.| 5m bucket | Total queries | Failed | Error rate | Note |
|---|---|---|---|---|
| 13:50 | 920 | 3 | 0.33% | Normal baseline |
| 13:55 | 880 | 4 | 0.45% | Normal |
| 14:00 | 910 | 22 | 2.42% | Spike begins |
| 14:05 | 935 | 31 | 3.32% | Sustained, ALERT |
- Classify the errors. The drill-down groups failures by error class. 28 of the 31 share the same message:
TABLE_OR_VIEW_NOT_FOUND: gold.revenue_daily. This is not a load problem or an out-of-memory issue; a table the dashboards depend on has gone missing. - Correlate with a recent change. A deployment at 13:58 renamed
gold.revenue_dailytogold.revenue_daily_v2and missed updating two dashboard datasets. Every refresh of those dashboards now errors. The timing lines up exactly with the spike onset. - Quantify who is affected. Because the failing queries all originate from the executive revenue dashboards, the blast radius is the leadership team during their afternoon review, high visibility, low query volume but high importance.
- Fix and confirm decay. A backward-compatible view (
gold.revenue_dailypointing at_v2) is created as an immediate mitigation. By 14:15 the error rate decays back to 0.4% and the card clears.
Sibling cards
| Card | Why pair it with SQL Query Error Rate Spike | What the combination tells you |
|---|---|---|
| SQL Query Error Rate % | The standing gauge this alert wraps. | The gauge shows the live ratio; this card escalates when it spikes and sustains. |
| SQL Query Latency p95 (ms) | Separates correctness from speed. | Errors plus high p95 equals an overloaded warehouse; errors with normal p95 equals a schema or grants fault. |
| Slow-Query Rate % | The slowness counterpart of this card. | If both spike together, suspect saturation; if only errors spike, suspect a change. |
| SQL Warehouse Saturation % | Shows whether the warehouse is overloaded. | High saturation alongside the spike points at OOM / spill errors from undersizing. |
| Top 10 Slowest SQL Queries | Names the heaviest queries when load is the cause. | A heavy query starving the warehouse can cause both slowness and errors. |
| SQL Queries per Hour (live) | The volume context behind the ratio. | A spike during a query-volume surge is load-driven; a spike at flat volume is change-driven. |
| Slow SQL Queries During Checkout Window | The cross-channel view if warehouse SQL backs storefront data. | Errors co-occurring with a checkout dip is a revenue-impacting fault. |
Reconciling against the source
Where to look in Databricks:SQL → Query History in the workspace UI, filtered to the last 5 to 15 minutes and to the Failed status, to see the failing statements and their error messages.Why our number may legitimately differ from Query History:system.query.historysystem table in Unity Catalog: the authoritative source. Queryexecution_statusgrouped by 5-minute bucket to reproduce the ratio exactly, and group by error class to classify the spike. SQL warehouse monitoring tab for the specific warehouse, to confirm whether the spike correlates with saturation, scaling, or a restart.
| Reason | Direction | Why |
|---|---|---|
| Cancelled-query handling | Vortex IQ rate lower | We exclude CANCELED from both numerator and denominator; a UI filter that lumps cancellations in with errors will read higher. |
| Window definition | Variable | This card uses a rolling 5-minute window; Query History defaults to a longer span. Match the range to reconcile. |
| system.query.history latency | Brief lag | The system table can take a short time to record the most recent statements; the live count may trail the UI by a minute. |
| Warehouse scope | Variable | The card aggregates across the connected warehouses; a single-warehouse UI view will differ if you have several. |
| Denominator basis | Variable | We divide by completed queries (FINISHED + FAILED); if you compute the ratio against all submitted statements (including queued) you will get a different number. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Databricks SQL Spike vs Ecom Order Rate | If warehouse SQL backs storefront features, an error spike may coincide with degraded storefront behaviour. | An error spike with no order-rate change means the failing queries were internal BI, not customer-facing. |
| Slow SQL Queries During Checkout Window | An error spike during the checkout window is the highest-priority case. | Errors outside any checkout-critical path are lower urgency despite the same headline number. |
Known limitations / FAQs
Our error rate briefly hit 5% but the alert never fired. Why? The trigger requires the rate to be sustained above 1% across the rolling 5-minute window, not just to touch it. In a low-traffic minute a single failed query can momentarily spike the ratio and then decay on the next evaluation. That is exactly the statistical noise the sustain requirement is designed to suppress. If the rate genuinely held above 1% and still did not fire, check the warehouse scope, the failures may have landed on a warehouse not included in the connector. Why is 1% the threshold? That seems very low. For a healthy SQL warehouse it is high. Well-behaved BI estates run their error rate far below 1%, with the rare failure being a malformed ad-hoc query. Sitting at or above 1% for five minutes means failures are arriving systematically, which is the definition of a fault. If your estate runs a lot of exploratory ad-hoc SQL where occasional user errors are normal, raise the threshold in the Sensitivity tab to match your baseline. Are cancelled or timed-out queries counted as errors? Cancelled queries (CANCELED) are excluded, because a cancellation is a deliberate abort by a user or a BI tool, not a warehouse failure. Client-side timeouts that the BI tool aborts also surface as cancellations and are excluded. Only statements that reached a FAILED terminal state count. This keeps the card focused on genuine query failures rather than normal interactive behaviour.
The spike is all one error class. What does that tell me?
A spike dominated by a single error class almost always points at a recent change rather than load. TABLE_OR_VIEW_NOT_FOUND means a renamed or dropped object; PERMISSION_DENIED means a grant was revoked or a service principal lost access; PARSE/ANALYSIS errors mean a deployed query is malformed. The first move is to look at what shipped in the last ten minutes, not to scale the warehouse.
The errors are mixed and many mention spill or out-of-memory. Is that the same problem?
No, that is a load problem, not a change problem. OOM and disk-spill failures mean the warehouse is undersized for the query mix it is being asked to run. Read SQL Warehouse Saturation % and Slow-Query Rate % alongside this card; the fix is usually to scale the warehouse up, enable multi-cluster scaling, or tune the heaviest queries from Top 10 Slowest SQL Queries.
Does this card cover errors on interactive clusters, or only SQL warehouses?
This card reads SQL statement history, which covers queries run against SQL warehouses (the dashboard and BI path). Spark jobs failing on all-purpose or job clusters are a different failure surface and are tracked via Failed Jobs (24h) and Failed Job Burst (>5 failures in 1h). During a metastore-wide outage you may see both surfaces light up together.