Query Error Rate Spike (>1% in 1h), Snowflake

Card class: Hero • Category: Nerve Centre

At a glance

An alert that fires when more than 1% of Snowflake statements fail over a sustained one-hour window. A healthy production account sits well under 1%: the background noise of cancelled queries and the odd syntax error from an ad-hoc worksheet. When the failure rate crosses 1% and stays there for an hour, something systemic has broken: a permissions change, a schema migration that dropped a column a job still references, a warehouse that keeps hitting statement timeouts, or an upstream load that is feeding malformed data. This card turns that into a single page rather than a flood of individual failures.


What it tracks	The proportion of statements ending in a failure state over a rolling one-hour window, evaluated against the 1% threshold.
Data source	`detail`: Alerts for Query Error Rate Spike (>1% in 1h). Computed from `EXECUTION_STATUS` and `ERROR_CODE` in `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY`.
Time window	`1h` (rolling one-hour window).
Alert trigger	`>1% sustained 1h`. Error rate above 1% held for the full hour, not a single-minute blip.
Roles	owner, platform, SRE, data engineering

Calculation

The engine reads completed statements from QUERY_HISTORY over the trailing hour and computes:

error_rate = failed_statements / total_statements   (over the rolling 1h window)

FIRE when error_rate > 0.01 sustained for the full hour

A statement counts as failed when its EXECUTION_STATUS is FAIL. User-initiated cancellations (EXECUTION_STATUS = 'INCIDENT' or explicit cancels) are handled separately so a developer aborting their own slow query does not inflate the rate. The ERROR_CODE and ERROR_MESSAGE columns are carried through so the alert can group failures by cause: a wall of identical error codes (for example 002003 object does not exist, or 000604 statement reached its timeout) tells the on-call exactly which class of failure is driving the spike. The “sustained” qualifier matters. A schema deploy can momentarily push the error rate above 1% for a minute as in-flight queries hit a table mid-rename. The card requires the breach to hold across the hour window before it pages, which filters those self-healing blips while still catching a real, persistent failure mode. Because QUERY_HISTORY in ACCOUNT_USAGE carries up to ~45 minutes of latency, the live-view variant of this card also reads INFORMATION_SCHEMA.QUERY_HISTORY (low latency, last 7 days) so the spike is caught in near real time rather than 45 minutes late.

Worked example

A data engineering team runs a Snowflake account behind an ecommerce analytics stack. At 14:00 on 18 Apr 26 a deploy renamed ORDERS.SHIPPING_ADDRESS to ORDERS.SHIP_ADDRESS but missed three downstream views and a reverse-ETL job. Snapshot of the 14:00 to 15:00 window:

Metric	Prior baseline (hourly)	14:00 to 15:00 window
Total statements	~9,000	9,140
Failed statements	~20 (0.2%)	412 (4.5%)
Dominant error code	mixed	`002003` object does not exist

The error rate held at 4.5% for the full hour, so the alert fired at 14:08 once the breach was sustained. The dashboard groups the failures by code:

Error breakdown, 14:00 to 15:00:
  002003 object/column does not exist   388 failures  (94%)
  000604 statement timeout               14 failures
  003001 insufficient privileges          6 failures
  other                                    4 failures

The 94% concentration on a single “does not exist” code points straight at the schema change. The on-call engineer does the following, in order:

Read the dominant code, not the count. 388 of 412 failures are the same object-not-found error. This is one root cause, not 412 problems. The deploy is the obvious suspect given the timing.
Confirm the blast radius. The failing statements reference SHIPPING_ADDRESS across three views and one job. Storefront dashboards relying on those views are showing stale or empty data, so the impact is visible to business users even though the warehouse itself is healthy.
Decide: roll forward or roll back. Fastest safe fix is a backward-compatible view that aliases SHIP_ADDRESS AS SHIPPING_ADDRESS, restoring the contract in minutes while the downstream objects are updated properly.
Watch the rate decay. Once the alias ships, the rolling rate should fall back under 1% within the hour as the failing queries start succeeding. The card auto-resolves when the window drops below threshold.

Impact framing:
  ~400 failed statements/hour, mostly dashboard refreshes feeding ops and merchandising teams.
  No direct credit cost (failed queries still consume compute up to the failure point), but
  several hours of stale operational reporting until the alias lands.

The lesson: in Snowflake the error-rate spike is usually a contract break (schema, permissions, or data shape), and the dominant ERROR_CODE is the fastest route to root cause. Read the code distribution before you read anything else.

Sibling cards

Card	Why pair it with Query Error Rate Spike	What the combination tells you
Query Error Rate %	The continuous gauge this alert is built on.	The gauge shows the trend; the alert pages when it crosses 1% sustained.
Top 10 Slowest Queries	Failures and timeouts often share a root cause.	If `000604` timeout dominates, the slow-query list shows which statements are timing out.
Failed Logins (24h)	The authentication-side error peer.	A privilege-error spike (`003001`) alongside failed logins suggests a broken role grant after an access change.
Query Latency p95 (ms)	Latency and timeouts move together.	Rising p95 plus timeout-driven failures equals an undersized or saturated warehouse.
Warehouse Queueing Sustained	Queue overflow can surface as failures.	Sustained queueing plus timeout errors means the warehouse cannot keep up.
Snowflake Health Score	The composite that weights error rate.	A firing error-spike alert pulls the headline score below threshold.
Slow Analytics Queries During Checkout Window	The cross-channel revenue-risk peer.	Error and slowness during a checkout window is the highest-priority combination.

Reconciling against the source

Where to look in Snowflake:

SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY is the authoritative statement ledger. Filter EXECUTION_STATUS = 'FAIL' over the hour to reproduce the numerator. INFORMATION_SCHEMA.QUERY_HISTORY (table function) for the low-latency live view used to catch the spike in near real time. Snowsight, Monitoring to Query History for the managed-service console view, where you can filter by status and group by error message.

A representative reconciliation query:

SELECT COUNT_IF(EXECUTION_STATUS = 'FAIL') AS failed,
       COUNT(*) AS total,
       ROUND(100 * failed / NULLIF(total, 0), 2) AS error_pct
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('hour', -1, CURRENT_TIMESTAMP());

Why our number may legitimately differ from Snowflake’s console:

Reason	Direction	Why
ACCOUNT_USAGE latency	Lag	`ACCOUNT_USAGE.QUERY_HISTORY` can lag up to ~45 minutes; the live card reads `INFORMATION_SCHEMA` to compensate, so a fresh worksheet query against ACCOUNT_USAGE may show fewer rows than the live card.
Cancellation handling	Vortex IQ rate lower	User-cancelled statements are excluded from the failure numerator; a Snowsight filter on “not success” will include cancellations and read higher.
Time zone	Window edges shift	Snowsight groups in account time zone; Vortex IQ aligns to your reporting time zone.
Statement scope	Variable	The card counts statements, including DDL and utility commands; a view filtered to SELECTs only will read a different denominator.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
`shopify.total_revenue` / `bigcommerce.total_revenue`	A spike that breaks reporting views does not directly move revenue, but breaks the dashboards leadership uses to watch it.	Revenue flat while the error spike runs confirms the impact is reporting visibility, not storefront.
Slow Analytics Queries During Checkout Window	Co-occurring slowness and errors during peak commerce windows.	If both fire together, prioritise; analytics load is contending with revenue-critical processing.

Known limitations / FAQs

My developers cancel slow queries all the time. Will that trip the 1% threshold? No. User-initiated cancellations are excluded from the failure numerator. The card counts statements that ended in EXECUTION_STATUS = 'FAIL', which is a genuine error, not someone aborting their own query. If your cancellation volume is unusually high you can see it in Query History, but it will not inflate this alert. Why 1% and why “sustained for an hour”? A well-run production account sits comfortably below 1% failures: the residual noise of typos in ad-hoc worksheets and the occasional transient. Crossing 1% means a class of statements is failing systematically. The one-hour sustain requirement filters the brief self-healing blips (a schema deploy, a momentary warehouse hiccup) so you are paged for persistent breakage, not transients. Both the threshold and the window are tunable per profile in the Sensitivity tab. The alert fired but each failure is a different error code. What does that mean? A spread of unrelated error codes usually points at infrastructure rather than a single contract break: a warehouse hitting repeated statement timeouts under load, a network or cloud-services blip, or a credential expiry affecting multiple roles. Read the code distribution: a single dominant code is a clean root cause (schema, permission, data shape); a flat spread points at capacity or platform. Pair with Query Latency p95 and Warehouse Saturation %. Do failed queries still cost credits? Yes, partially. A statement that fails part-way through has already consumed compute on its warehouse up to the point of failure, so a high-volume error spike does carry some credit cost even though no results are returned. It is rarely the headline cost, but a tight retry loop that fails repeatedly can add up; cross-check Credits Burned (24h). Does this catch errors in tasks and Snowpipe, or only interactive queries? It counts statements recorded in QUERY_HISTORY, which includes statements run by tasks and stored procedures. Snowpipe and serverless ingestion have their own histories (COPY_HISTORY, PIPE_USAGE_HISTORY) and are not folded into this rate; for ingestion failures use the ingest-focused cross-channel cards such as Snowflake Event Ingest vs Ecom Orders. The console shows a different failure count than the card. Why? Most often latency: ACCOUNT_USAGE.QUERY_HISTORY can lag up to 45 minutes, so a worksheet query against it during a live spike will undercount. The card reads the low-latency INFORMATION_SCHEMA view to stay current. Secondary causes are cancellation handling (the card excludes user cancels) and time-zone alignment on the window edges.

Tracked live in Vortex IQ Nerve Centre

Query Error Rate Spike (>1% in 1h) is one of hundreds of KPI pulses Vortex IQ tracks across Snowflake and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre