Query Error Rate %, MySQL - Vortex IQ Help Centre

Card class: Hero • Category: Errors

At a glance

Query Error Rate % is the share of statements the server attempted that ended in an error rather than a clean result, evaluated over a 5-minute window. It is the single most direct “is something broken?” signal for a MySQL instance. A healthy production database sits at or very near zero; even 1% means one statement in a hundred is failing, which at storefront volumes is hundreds of failed operations a minute (a failed checkout, a dropped cart write, a 500 served to a shopper). Because the failure is binary and customer-visible, this is a Hero sensitivity card with a low, deliberate alert threshold.


What it tracks	The percentage of attempted statements that returned an error over the selected period.
Data source	Error counters from `SHOW GLOBAL STATUS`, principally the aborted/error families, expressed as a ratio against `Questions` (total attempted statements) over the same window.
Time window	`5m` (5-minute evaluation window; the rate is computed from counter deltas across the window, not a single instantaneous read).
Alert trigger	`> 1%`. Sustained error rate above 1% pages the on-call; for many OLTP workloads even a brief breach above 0.1% is worth a look.
Aggregation	Windowed ratio. Numerator is the error-count delta over the window; denominator is the `Questions` delta over the same window.
Units	Percentage (0 to 100). The card also exposes the raw error count so you can see absolute volume, not just the ratio.
Roles	owner, engineering, operations

Calculation

The card computes a windowed ratio of failed statements to attempted statements:

Query Error Rate % = (error_count delta over 5m / Questions delta over 5m) * 100

Both halves are drawn from cumulative SHOW GLOBAL STATUS counters and turned into a rate by taking the delta across the 5-minute window. The denominator is Questions, the same attempt-counting basis the Queries per Second (live) card uses, which keeps the two cards consistent: every attempted statement that the QPS card counts is eligible to appear in this card’s denominator. The numerator aggregates the server’s error and abort counters. MySQL does not expose a single “total query errors” status variable, so the rate is built from the relevant error-family counters that the server does expose, including the connection-abort counters (Aborted_connects, Aborted_clients) and the access-denied and error-handler counters surfaced through performance_schema where available. On MySQL 8.0 the richest source is performance_schema.events_errors_summary_global_by_error, which records a SUM_ERROR_RAISED per error code; the card can roll that up into a total when Performance Schema error instrumentation is enabled. The 5-minute window matters for two reasons. First, it smooths out single-statement blips (one bad ad-hoc query from an analyst should not page the on-call). Second, it makes the rate meaningful at low volume: on a quiet instance a single error in a 5-second window would read as a huge percentage, whereas across 5 minutes it is correctly diluted by total volume. Sustained breach over the window is the alert condition, not a momentary spike.

Worked example

A platform team runs a MySQL 8.0 primary behind the checkout and order services for a retailer. Baseline error rate is effectively 0.00% (a handful of errors a day from ad-hoc analyst queries). Snapshot taken on 16 Apr 26 from 13:00 BST, shortly after a schema migration was deployed.

Window (5m)	Questions delta	Error delta	Error Rate %	State
12:50 to 12:55	1,020,000	12	0.001%	Healthy
12:55 to 13:00	1,015,000	30	0.003%	Healthy
13:00 to 13:05	998,000	18,400	1.84%	Alert
13:05 to 13:10	1,002,000	19,100	1.91%	Alert sustained

At 13:05 the rate crosses 1% and the card fires. The DBA pulls the error breakdown from Performance Schema:

SELECT ERROR_NUMBER, ERROR_NAME, SUM_ERROR_RAISED
FROM performance_schema.events_errors_summary_global_by_error
WHERE SUM_ERROR_RAISED > 0
ORDER BY SUM_ERROR_RAISED DESC
LIMIT 5;

ERROR_NUMBER  ERROR_NAME                       SUM_ERROR_RAISED
        ER_BAD_FIELD_ERROR               18,900   <- "Unknown column"
        ER_NO_SUCH_TABLE                 120
        ER_LOCK_DEADLOCK                 80

The dominant error is 1054 ER_BAD_FIELD_ERROR (“Unknown column”). The 13:00 migration renamed a column the application’s order-write path still references by its old name. Every checkout that reaches that write fails. The corrective path:

Confirm customer impact. Cross-check Slow Queries During Checkout Window (5m) and the storefront’s own 5xx rate. Failed order writes mean lost sales, so this is a revenue incident, not just a database one.
Roll back the breaking change, not the data. A column rename can usually be made backward-compatible by adding the old name back as a generated/aliased column, or by rolling the application to the version that uses the new name. Rolling back the schema is safer than rolling back order data.
Hold the alert open until the rate returns to baseline. A migration fix can take minutes to deploy; the card should stay red until error rate is back near 0.00% across a full window.

Cost framing while the error is live:
  - Checkout write failure rate: ~1.9% of all statements
  - Of those, the order-insert path is the customer-facing slice
  - Estimated failed checkouts: ~40/min during the window
  - At an average order value of 58 GBP: ~2,320 GBP/min exposed
  - 10-minute incident before rollback: ~23,000 GBP at risk

Three takeaways:

Even 1% is a lot. At a million statements per 5 minutes, 1% is ~10,000 failures. Error rate is one of the few database metrics where the threshold sits far below “feels broken”.
The error code is the diagnosis. The percentage tells you something is wrong; events_errors_summary_global_by_error tells you what. Always pull the breakdown before guessing.
A jump right after a deploy is a deploy bug until proven otherwise. Schema renames, removed columns, and changed grants are the usual culprits. Correlate the spike’s start time with your deploy log.

Sibling cards

Card	Why pair it with Query Error Rate %	What the combination tells you
Query Error Rate Spike (>1% in 5m)	The alert-list card that fires off this exact metric.	The gauge shows the level; the alert card shows when it breached and for how long.
Queries per Second (live)	The denominator behind the ratio.	A flat error rate with rising QPS means absolute failures are climbing; check the raw count.
Connection Errors (24h)	Connection-level failures vs statement-level failures.	If errors are mostly connection aborts, the cause is networking or auth, not bad SQL.
Aborted Connects (24h)	A specific error family feeding the rate.	A spike here driving the error rate points at credentials, network, or `max_connect_errors`.
InnoDB Deadlocks (last 5m)	Deadlocks surface as error 1213.	A deadlock storm shows up as both a deadlock count and a contribution to the error rate.
Slow Queries During Checkout Window (5m)	The revenue-path view during an error event.	Errors plus slow checkout queries together size the customer impact.
MySQL Health Score	The composite that weights error rate heavily.	A sustained error-rate breach is one of the fastest ways to drop the health score.
Query Latency p95 (ms)	Distinguishes “failing fast” from “failing slow”.	Errors with high latency means timeouts; errors with low latency means immediate rejections (bad SQL, denied grants).

Reconciling against the source

Where to look on the instance:

SELECT * FROM performance_schema.events_errors_summary_global_by_error WHERE SUM_ERROR_RAISED > 0 ORDER BY SUM_ERROR_RAISED DESC; for the authoritative per-error-code breakdown (MySQL 8.0). SHOW GLOBAL STATUS LIKE 'Aborted%'; for connection and client abort counters. SHOW GLOBAL STATUS LIKE 'Questions'; for the denominator. The server error log (log_error location) for the actual error text and the statements that triggered it.

To reproduce the card’s rate over a window, capture the error and Questions counters at the start and end of the period and divide the deltas. Performance Schema error summaries are cumulative since the last TRUNCATE of the table or server restart, so use deltas, not absolute totals. On a managed service:

Service	Where to confirm
Amazon RDS / Aurora	There is no single “error rate” CloudWatch metric; use the `Aborted_clients` and `Aborted_connects` enhanced-monitoring counters, and enable the error log export to CloudWatch Logs to see the actual error codes. Performance Insights does not surface error rate directly.
Google Cloud SQL	Inspect the MySQL error log via Cloud Logging; the `database/mysql/innodb/...` metrics cover deadlocks but not a blanket error rate.
Azure Database for MySQL	The `aborted_connections` metric in Azure Monitor; error codes via the server logs.

Why our number may legitimately differ from a native reading:

Reason	Direction	Why
Performance Schema error instrumentation disabled	Card lower	If `performance_schema` error instruments are off, the card falls back to the narrower abort counters and undercounts statement-level errors.
Counter reset	Card temporarily off	A server restart or a `TRUNCATE` of the error summary table resets the cumulative base; the first window after that is computed from a low base.
What counts as an “error”	Either way	Warnings are not errors. A statement that completes with a warning (truncated value, implicit conversion) does not count here, though some native dashboards lump warnings and errors together.
Window alignment	Marginal	The card uses a rolling 5-minute window; a console aggregating per calendar minute will draw period boundaries differently.

Known limitations / FAQs

My error rate is 0.00% but customers report failed checkouts. How? The failure may not be reaching the database as an error. If the application times out before MySQL responds, or a connection-pool exhaustion event rejects the client before a statement is even sent, the customer sees a failure but the database records no statement error. Check Connection Pool Saturation % and Connection Errors (24h); a failure that never became a query will not show here. Why is the threshold as low as 1%? Because at production volume 1% is enormous. A storefront primary handling a million statements per 5-minute window has ten thousand failures at 1%. Most of those map to customer-facing operations, so the threshold is set where the business impact is already material. For critical OLTP paths, consider tightening the sensitivity below 1% in the Sensitivity tab. What error codes are the most common contributors? In practice: 1213 ER_LOCK_DEADLOCK (contention), 1205 ER_LOCK_WAIT_TIMEOUT (lock waits), 1054 ER_BAD_FIELD_ERROR and 1146 ER_NO_SUCH_TABLE (schema drift after a deploy), 1062 ER_DUP_ENTRY (unique-key violations), and 1040 ER_CON_COUNT_ERROR (too many connections). The breakdown query in the reconcile section gives you the exact mix for your incident. Do deadlocks count as errors here? Yes. A deadlock returns error 1213 to the loser of the deadlock, so it increments the error count and contributes to this rate. That is why a deadlock storm shows up on both this card and InnoDB Deadlocks (last 5m). The deadlock card isolates that specific cause; this card shows its weight against total volume. The rate spiked then returned to zero on its own. Should I still investigate? Usually yes, briefly. A self-resolving spike often means a transient cause (a deploy that auto-rolled back, a lock contention burst that cleared, a single bad batch job that finished). Pull the error breakdown for the spike window to confirm the cause was transient and not the leading edge of a recurring problem. A spike that recurs on a schedule (every hour, every nightly batch) is a structural issue, not a blip. Does a warning count as an error? No. MySQL distinguishes errors (the statement failed) from warnings (the statement completed but something was off, such as a truncated value or an implicit type conversion). This card counts only errors. If you want to track warnings, that is a separate signal; a high warning rate often precedes data-quality problems but is not an availability issue. My instance has Performance Schema disabled. Does the card still work? Partially. With Performance Schema error instrumentation off, the card cannot read the per-error-code summary and falls back to the abort counters from SHOW GLOBAL STATUS, which capture connection and client errors but miss many statement-level errors. The number will be lower and less precise. Enabling performance_schema (and the error instruments) gives the card its full fidelity; on managed services it is usually on by default.

Tracked live in Vortex IQ Nerve Centre

Query Error Rate % is one of hundreds of KPI pulses Vortex IQ tracks across MySQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre