At a glance
The alert that fires when more than 1% of statements return an error over a rolling 5-minute window. A healthy MySQL instance sits well under 0.1%: the odd duplicate-key collision or deadlock retry. When the error rate jumps past 1% and holds, something structural has broken: a missing table after a bad migration, a deadlock storm, a column that no longer exists, or the server refusing connections. This is the hero card that catches a regression in the seconds and minutes after a deploy, long before the slow drip of customer complaints arrives.
| Status source | Errored statements as a fraction of total statements. Derived from performance_schema.events_statements_summary_global_by_event_name (SUM_ERRORS over COUNT_STAR), or the Com_* and error-counter deltas where Performance Schema statement instrumentation is unavailable. |
| Metric basis | error_rate = errored_statements / total_statements over the trailing 5 minutes. Counts any statement that returned a non-zero error code to the client. |
| Aggregation window | Rolling 5 minutes, evaluated continuously. The alert requires the rate to stay above 1% for the full sustained window so a single bad query does not page anyone. |
| Alert threshold | > 1% sustained for 5 minutes. Below this it is informational; at or above it the card raises a hero alert into the Nerve Centre feed. |
| What counts as an error | Any statement returning an error to the client: syntax errors, missing table/column, access denied, deadlock victims, lock-wait timeouts, constraint violations, and connection-stage refusals reaching the statement layer. |
| What does NOT count | (1) Warnings (a truncated value, an implicit cast) that do not fail the statement; (2) successful statements that returned zero rows; (3) slow-but-successful queries (that is the slow-query card’s job); (4) client-side timeouts where the server completed the statement. |
| Common causes | A migration that dropped or renamed an object the app still references; a deadlock storm from a hot-row contention pattern; expired or revoked grants; a full disk turning writes into errors. |
| Time zone | Ratio is time-zone independent; chart axes render in the merchant display time zone set in the Vortex IQ profile. |
| Time window | 5m (rolling, sustained evaluation). |
| Alert trigger | > 1% error rate sustained for 5 minutes. |
| Roles | dba, platform, sre, owner |
Calculation
The engine computes, over the trailing 5-minute window:events_statements_summary_global_by_event_name (summing SUM_ERRORS and COUNT_STAR across the statement/sql/% event families) between the window’s start and end samples. Where that instrumentation is off, the engine falls back to global error counters (Com_* totals for the denominator and the sum of relevant error counters for the numerator), which is slightly coarser but tracks the same signal.
The alert is stateful and sustained: the rate must exceed 1% across the rolling window for the full duration before it fires, which filters out a one-off bad statement. It clears when the rate falls back under 1%. The card also surfaces the dominant error code over the window (for example, “1062 duplicate entry” or “1213 deadlock”) so the first glance tells you the kind of failure, not just the volume.
Worked example
A platform team ships a schema migration to a MySQL 8.0 primary at 14:30 on 02 Jun 26. The deploy renames a column fromcustomer_ref to customer_id but one application service is still on the old build referencing the old name. Snapshot taken minutes later.
| Sample time | Statements (5m) | Errored | Error rate | Dominant code |
|---|---|---|---|---|
| 14:25 | 184,200 | 92 | 0.05% | 1213 (deadlock) |
| 14:30 | 181,900 | 110 | 0.06% | 1213 (deadlock) |
| 14:32 | 178,400 | 4,120 | 2.31% | 1054 (unknown column) |
| 14:34 | 176,800 | 5,540 | 3.13% | 1054 (unknown column) |
| 14:35 | 175,100 | 5,980 | 3.42% | 1054 (unknown column) |
- The baseline tells you what is normal. Pre-deploy the rate sat at 0.05%, almost entirely deadlock retries (1213), which are self-healing and expected. The jump is a different error class (1054), not more of the same, which immediately rules out a load problem and points at a schema mismatch.
- The timing pins the cause. The error class changed at 14:32, two minutes after the 14:30 migration. A new error code appearing right after a deploy is the textbook regression signature.
- The fix is a rollback decision, not a query tune. Either roll the lagging service forward to the new build (which references
customer_id), or roll the migration back if the column rename cannot wait. Because 1054 fails the statement outright, every affected request is erroring, so this is a revenue-impacting incident while it persists.
- The error class matters more than the count. A rise in deadlocks (1213) is a contention problem you tune; a rise in 1054/1146 (missing column/table) is a schema regression you roll back. The dominant-code label is there so you act on the right one.
- Tie the spike to the timeline. A new error code appearing within minutes of a deploy is a regression until proven otherwise. Always check the change log against the moment the rate crossed 1%.
- Some errors are healthy in small doses. Deadlock victims (1213) and lock-wait timeouts (1205) are normal at low rates because the app retries them. The alert’s job is to catch the abnormal spike, which is why the threshold sits at 1% rather than zero.
Sibling cards
| Card | Why pair it with this alert | What the combination tells you |
|---|---|---|
| Query Error Rate % | The continuous gauge this alert is built on. | The gauge shows the trend and baseline; this card is the threshold breach with the dominant code. |
| InnoDB Deadlocks (last 5m) | Isolates the 1213 component of the error rate. | If the spike is deadlock-driven, this card spikes in lockstep; if not, the cause is elsewhere. |
| Connection Errors (24h) | Connection-stage failures that can reach the statement layer. | Distinguishes “queries failing” from “clients cannot connect at all”. |
| Slow-Query Rate % | The other regression signal after a deploy. | A deploy can cause errors or slowness; check both to characterise the regression. |
| Queries per Second (live) | The denominator context. | A 1% rate on huge volume is many more failures than 1% on light traffic; QPS sizes the real impact. |
| Database Disk Usage % | A full disk turns writes into errors. | If errors are write failures, disk usage near 100% is the likely root cause. |
| MySQL Health Score | The composite this alert dominates. | An active error spike drops the health score sharply because it signals broken functionality. |
| Slow Queries During Checkout Window (5m) | The cross-channel revenue framing. | Confirms whether the error spike overlaps the storefront’s revenue-critical path. |
Reconciling against the source
Where to look in MySQL itself:Why our number may legitimately differ from a raw query:SELECT EVENT_NAME, COUNT_STAR, SUM_ERRORS FROM performance_schema.events_statements_summary_global_by_event_name WHERE COUNT_STAR > 0 ORDER BY SUM_ERRORS DESC;for errored statements by type.SELECT * FROM performance_schema.events_errors_summary_global_by_error ORDER BY SUM_ERROR_RAISED DESC;(MySQL 8.0) for a breakdown by exact error number.SHOW GLOBAL STATUS LIKE 'Com_%';for statement counts, and the InnoDB-specific counters (SHOW ENGINE INNODB STATUS\G) for the latest deadlock. The error log (log_error) for the verbose text of access-denied, missing-object, and disk-full errors as they happen.
| Reason | Direction | Why |
|---|---|---|
| Window alignment | Variable | The card uses a rolling 5-minute window; a manual query over a different range will not match exactly. |
| Instrumentation scope | Card may be lower | If performance_schema statement instruments are partially disabled, some statement families are not counted in either numerator or denominator. |
| Warnings excluded | Card lower | Statements that raised only a warning (not an error) are not counted as errors here, but a naive SHOW WARNINGS-based view might include them. |
| Counter reset | Card rebaselines | A server restart or a performance_schema truncate zeroes the summaries; the card rebaselines rather than reporting a negative delta. |
Known limitations / FAQs
Why 1% and not zero? Surely any error is bad. A small constant rate of errors is normal and healthy in a busy OLTP system: deadlock victims that the application retries (1213), occasional duplicate-key collisions on idempotent upserts (1062), and the odd lock-wait timeout (1205). These self-heal. Alerting at zero would page on noise constantly. The 1% sustained threshold catches the abnormal spike that signals a real regression while ignoring the healthy background rate. The card says the dominant code is 1213 (deadlock). Is that a real problem? At a low rate, no, deadlocks are an expected part of concurrent transactions and the application should retry the victim. But if 1213 is the dominant code and the rate has crossed 1%, you have a contention hotspot: many transactions fighting over the same rows in conflicting orders. Cross-reference InnoDB Deadlocks (last 5m) and useSHOW ENGINE INNODB STATUS to see the exact statements involved, then fix the access-order or add an index to shrink the locked range.
My error rate spiked but customers did not notice. How?
The errored statements may be on a non-customer-facing path: a background reconciliation job, a reporting query, or a retry that succeeded on the second attempt. The card counts statement-level errors regardless of whether the application masked them. Look at the dominant code and which queries it maps to; if they are all from a batch worker, the customer impact is low even though the rate is high.
A deploy went out and the rate spiked with code 1146 (table doesn’t exist). What happened?
A migration almost certainly dropped or renamed a table that some application code still references, or the migration ran on the replica but not the primary (or vice versa). 1146 and 1054 (unknown column) are the signature of a schema/code mismatch. Check that every service is on the build that matches the new schema, and confirm the migration applied everywhere it should have.
Could a full disk show up here?
Yes. When the data volume fills, InnoDB cannot extend tablespaces and write statements start failing (often error 1114 “table is full” or generic write errors). The error rate spikes and is dominated by write codes. Always cross-reference Database Disk Usage %; if it is near 100%, free space or extend the volume before anything else.
Does a client-side timeout count as an error here?
No. If the client gives up and closes the connection while the server is still executing, the server-side statement may complete successfully and is not counted as an error by this card. That kind of failure shows up as an aborted client and as slow-query pressure, not as a query error. Use the slow-query and latency cards to catch it.
Can I tune the threshold or sustain window?
Yes, both are configurable per profile in the Sensitivity tab. A system with an unusually chatty retry pattern may want a slightly higher threshold; a low-volume system where every error matters may want a shorter sustain window so it fires faster. Tune to your own baseline error rate.