At a glance
Statement Error Rate % is the share of SQL statements that returned an error instead of completing successfully, measured over a rolling 5-minute window. It is the cluster’s quality signal sitting next to the throughput (quantity) signal: throughput tells you how much work the database is doing, error rate tells you how much of it is failing. A rising error rate is one of the earliest, broadest symptoms of trouble, whether the cause is contention forcing transaction aborts, a connection pool refusing new sessions, a node loss, or a bad application deploy pushing malformed SQL. Because errors translate directly into failed user actions (a checkout that would not save, a cart that would not update), this is a card that pages.
| What it tracks | Statement Error Rate % for the selected period: SQL statements that returned an error as a percentage of all statements executed, over the rolling window. |
| Data source | The errored-statement counters (sql.failure.count and the per-error-class counters) measured against total executed statements (sql.query.count). The DB Console SQL dashboard surfaces the same as SQL statement errors. |
| Time window | 5m. A rolling 5-minute window smooths single transient failures while still catching a developing problem within minutes. |
| Alert trigger | > 1%. Above roughly 1 in 100 statements failing, the failures are no longer background noise; they represent user-visible breakage at a rate worth paging on. |
| Roles | DBA, platform, SRE |
Calculation
The card divides errored statements by total executed statements over the rolling 5-minute window and expresses the result as a percentage. The numerator is the SQL failure counter (sql.failure.count), which increments whenever a statement returns an error to the client; the denominator is the total executed-statement counter (sql.query.count). Computing it as a ratio rather than an absolute count is deliberate: 50 errors per minute is alarming at 1,000 statements/sec and negligible at 50,000 statements/sec, so the rate normalises across traffic levels and against a moving baseline.
Not all errors are equal, and reading the error class matters as much as the rate. The big categories on CockroachDB are: retry errors (most commonly 40001 serialization failures from transaction contention, often retried transparently and sometimes surfaced to the client), connection errors (refused or dropped sessions, typically from pool saturation or a node loss), constraint and syntax errors (23505 unique-violation, 42xxx syntax, usually application or schema problems, not infrastructure), and unavailability errors (a range without quorum). The 5-minute window is short enough to surface a developing incident quickly but long enough that one fluky failure does not move the headline. Sustained breach above 1% is the actionable state because it means a meaningful fraction of user actions are failing.
Worked example
A platform team runs a 6-node CockroachDB cluster behind the order, cart, and inventory services. Steady-state error rate sits around 0.05% (almost entirely harmless transient retries). Snapshot taken on 18 Jun 26 at 19:40 BST, mid evening peak.| Time (5m window) | Statements | Errored | Error rate % | State |
|---|---|---|---|---|
| 19:25 | 2.7M | 1,400 | 0.05% | healthy |
| 19:35 | 2.9M | 23,400 | 0.81% | climbing |
| 19:40 | 3.0M | 41,200 | 1.37% | alert |
40001 serialization failures, the contention signature, not connection errors or syntax errors.
That points the investigation cleanly. A contention spike means many transactions are fighting over the same rows. The DBA correlates with Transaction Retries (24h) (retries up sharply in the last 15 minutes) and Top Contended Statements, which names the offender: an inventory-decrement statement hammering a single hot SKU row during a flash sale on one product.
- The rate tells you something is wrong; the error class tells you what. A 1.37% rate made of
40001retries (contention) is a completely different incident from 1.37% made of connection refusals (pool saturation) or42xxx(a bad deploy). Always read the class before acting. - A ratio, not a count, is the right alarm. During peak traffic a large absolute error count can still be a tiny percentage; during a quiet window a handful of errors can be a high percentage. The 1% rate is what reliably maps to user impact.
- Errors are user-visible by definition. Every counted error is an action that failed for someone. Pair the rate with the business cards (orders, checkout) to size who was affected, then fix at the layer the error class points to.
Sibling cards
| Card | Why pair it with Statement Error Rate % | What the combination tells you |
|---|---|---|
| Statement Error Rate Spike (>1% in 5m) | The alert-list view of breaches. | When this gauge crosses 1%, the alert card records the spike, onset, and dominant error class. |
| Transaction Retries (24h) | The contention driver behind retry errors. | Error rate up tracking retries equals a contention-driven spike, not infrastructure. |
| Top Contended Statements | Names the statement causing serialization failures. | Identifies the exact hot row or query behind a 40001 spike. |
| Connection Pool Saturation % | A different error source: refused connections. | Error rate up with saturation up equals connection errors, not contention; fix the pool. |
| Unavailable Ranges | The most serious error source: quorum loss. | Errors with unavailable ranges above 0 equals data is offline; this is a top-severity incident. |
| Statements per Second (live) | The throughput denominator. | Error rate read against throughput separates “a few errors at high volume” from “broad failure”. |
| CockroachDB Health Score | The composite that weights error rate. | A sustained error-rate breach drags the health score down quickly. |
Reconciling against the source
To confirm the figure natively, open the DB Console SQL dashboard and read the SQL Statement Errors chart, then break errors down by class. Querycrdb_internal.node_metrics for the sql.failure.count and per-error-class counters against sql.query.count. The DB Console SQL Activity → Statements page shows failure counts per statement fingerprint, and crdb_internal.cluster_contention_events confirms whether the errors are contention-driven. On CockroachDB Cloud the SQL errors chart appears on the cluster Metrics page.
| Reason our number may differ | Direction | Why |
|---|---|---|
| Retry accounting. Transparently retried (and eventually succeeding) transactions may or may not be counted as errors. | Variable | The card reflects statements that returned an error to the client; a view counting only final outcomes reads lower. |
| Error-class scope. Some native charts show one class (for example only connection errors). | Vortex IQ usually higher | The card sums all error classes into the headline; a single-class chart reads lower. |
| Window length. The card uses a rolling 5-minute window. | Marginal | A chart smoothed over a longer resolution flattens a short spike the card catches. |
| Time zone. Chart axes render in the cluster locale. | Cosmetic | Axis labels shift; the rate does not. |
Known limitations / FAQs
My error rate spiked but the cluster looks healthy. What is going on? Read the error class. By far the most common cause of a spike on a healthy cluster is40001 serialization failures from transaction contention, many transactions fighting over the same hot rows. The infrastructure is fine; the fix is in the data-access pattern (batch writes, reservation patterns, sharded counters). Check Top Contended Statements.
What is a 40001 error and is it actually a problem?
40001 is a serialization failure: CockroachDB enforces serializable isolation, so when transactions conflict it aborts one and expects a retry. CockroachDB retries many of these transparently, in which case they never reach the user. They become a problem when contention is high enough that retry budgets are exhausted and the failure surfaces to the client. A low background rate is normal; a spike is contention to fix at the source.
Why a percentage instead of an error count?
Because the same count means very different things at different traffic levels. A few hundred errors are trivial at 50,000 statements/sec and a crisis at 500/sec. The ratio normalises against traffic so the 1% threshold maps consistently to real user impact.
The rate is high but made of syntax errors, not retries. Whose problem is that?
Almost certainly an application or schema problem, not the database. A burst of 42xxx (syntax) or 23505 (unique violation) errors usually means a bad deploy is sending malformed or conflicting SQL. Roll back the deploy or fix the query; the cluster is behaving correctly by rejecting bad statements.
Does a node loss show up here?
It can, as connection errors (sessions on the lost node are dropped) and briefly as unavailability errors if any range lost quorum during the failover. If the error spike coincides with a Cluster Node Count drop, the node loss is the cause and the rate should recover as leases transfer and clients reconnect.
Why a 5-minute window rather than real-time?
To balance speed against noise. A single fluky failure should not page anyone, but a developing incident should be caught within minutes. The rolling 5-minute window smooths one-off errors while still surfacing a genuine climb quickly.
Can I tune the 1% threshold for my cluster?
Yes, sensitivity thresholds are configurable per profile in the Sensitivity tab. If your workload runs a higher harmless-retry baseline (some high-contention workloads do), raise the threshold to match your normal so the alert reflects genuine deviation rather than your steady state.