At a glance
The alert feed for moments when more than 1% of queries fail over a rolling five-minute window. ClickHouse logs every query insystem.query_log, and a sudden rise inExceptionWhileProcessingrows means something has broken: a bad deploy shipping malformed SQL, memory limits killing heavy queries, a table gone read-only, or the parts cap halting ingest. A healthy instance sits well under 1%, so crossing that line for a sustained window is a real signal, not noise. This card lists each spike with its start time, peak error rate, and the dominant exception code so the DBA can go straight to the cause.
| Data source | system.query_log, comparing type = 'ExceptionWhileProcessing' rows against total finished queries over a rolling 5-minute window. Each sustained breach is recorded as one alert row with its top exception code. |
| What it tracks | Error-rate spike events, not a continuous gauge. The live percentage lives on Query Error Rate %; this card is the incident history of when that rate crossed 1%. |
| Why it matters | At analytical query volumes, 1% failure is a lot of broken queries: at 500 QPS that is 5 failed queries every second, every one a stalled dashboard, a dropped ingest batch, or a failed report. The exception code on each row points straight at the cause (code 241 memory, code 252 too many parts, code 60 unknown table, and so on). |
| Time window | 5m (rolling five-minute window; the rate must stay above 1% across the window to fire). |
| Alert trigger | >1% sustained 5m. The error-to-total ratio must hold above 1% for the full window, filtering out single-query blips. |
| Roles | dba, platform, sre |
Calculation
The detector computes the failure ratio over a rolling five-minute window directly fromsystem.query_log:
QueryFinish) plus those that threw (ExceptionWhileProcessing). QueryStart rows are excluded so in-flight queries do not distort the ratio. An alert row is created when error_rate_pct > 1 holds across the window.
The sustained-window requirement is what separates a real incident from background noise. Individual queries fail all the time for benign reasons (a user typos a column name, a one-off query exceeds a memory limit), and at high volume that produces a constant trickle of exceptions well under 1%. Requiring the rate to hold above 1% for a full five minutes means the failures are systemic: a deploy, a resource exhaustion, or a structural fault, not one unlucky query. Each alert row also captures the dominant exception_code so triage starts with the cause already identified.
Worked example
A platform team runs ClickHouse behind a reporting API and an ingest pipeline at roughly 480 QPS steady state. Snapshot of the alert feed on 18 Apr 26.| Started | Peak error rate | Top exception code | Likely cause |
|---|---|---|---|
| 18 Apr 26 13:22 BST | 4.8% | 241 (MEMORY_LIMIT_EXCEEDED) | A new dashboard shipped an un-aggregated full-table scan |
| 17 Apr 26 02:10 BST | 2.1% | 252 (TOO_MANY_PARTS) | Ingest hit the parts cap on events.clickstream_raw |
| 16 Apr 26 19:44 BST | 1.4% | 60 (UNKNOWN_TABLE) | A deploy referenced a table before its migration ran |
- The 13:22 event is a memory problem from a deploy. Code 241 (
MEMORY_LIMIT_EXCEEDED) at 4.8% means heavy queries are being killed bymax_memory_usage. The timing next to a release strongly suggests a new query that scans without aggregating. Cross-check MEMORY_LIMIT_EXCEEDED (24h) to confirm the count and find the query text. - The 02:10 event was an ingest break, not a query break. Code 252 (
TOO_MANY_PARTS) means inserts were rejected because a table hit the parts cap. Those rejected inserts show as errors here too. The fix lives in the ingest pipeline, traceable via Too Many Parts Errors (24h). - The 16 Apr event was a deploy-order bug. Code 60 (
UNKNOWN_TABLE) means application code referenced a table that did not exist yet, a classic migration-ran-after-deploy race. It self-resolved once the migration completed.
- The exception code is the diagnosis. Do not treat “error rate up” as a single problem. 241, 252, and 60 have completely different fixes. Always read the dominant code on the row first.
- 1% is a lot at analytical volume. At hundreds of QPS, 1% is several failed queries per second. The threshold is low on purpose because the blast radius is high.
- This card is the timeline; the per-code cards are the depth. Use this feed to spot when and what kind, then drill into MEMORY_LIMIT_EXCEEDED (24h), Too Many Parts Errors (24h), or Failed Queries (24h) for the detail.
Sibling cards
| Card | Why pair it with this alert | What the combination tells you |
|---|---|---|
| Query Error Rate % | The live gauge this alert watches. | The gauge shows current rate; this card shows when it spiked and why. |
| Failed Queries (24h) | The 24-hour absolute count of exceptions. | A spike here plus a high daily count equals a persistent fault, not a one-off. |
| MEMORY_LIMIT_EXCEEDED (24h) | The code-241 detail. | Spike with code 241 dominant equals heavy queries swamping memory. |
| Too Many Parts Errors (24h) | The code-252 detail. | Spike with code 252 dominant equals ingest break, not query break. |
| Slow-Query Rate % | Slow queries often precede memory kills. | Rising slow-query rate before an error spike equals queries degrading toward failure. |
| Queries per Second (live) | The denominator context. | A spike at low QPS is more alarming per-query than the same percentage at high QPS. |
| ClickHouse Health Score | The composite that weights error rate. | A sustained error spike drags the composite below 70. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Compute the same rate directly fromWhy our number may legitimately differ from a manual check:system.query_loginclickhouse-client:Decode any exception code withSELECT name, code FROM system.errors WHERE code = 241. On ClickHouse Cloud, the samesystem.query_logquery runs in the SQL console, and the managed monitoring view surfaces query error rate over time.
| Reason | Direction | Why |
|---|---|---|
| Window alignment | Either | The card uses a rolling 5-minute window; a manual query with a fixed now() - 5 MINUTE boundary samples a slightly different slice. |
| log_queries setting | Card lower if logging is sampled | If log_queries_probability < 1, system.query_log samples queries and both the card and a manual query undercount. Set it to 1 for accurate rates. |
| Denominator choice | Either | Including or excluding QueryStart rows changes the ratio. The card counts only finished and failed queries; a query that counts all type values will read lower. |
| Async inserts | Card may differ | Failures inside async-insert flushes are logged differently; the card accounts for them, a naive query may miss them. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| ClickHouse QPS Spike vs Ecom Order Rate | An error spike that coincides with a QPS spike but no order spike points to a dashboard storm or bot. | Errors rising while orders flow normally means the failures are query-side, not affecting the storefront pipeline. |
Known limitations / FAQs
The card fired but every query in the window looks like a user typo. Is it a false positive? Not necessarily, but read the exception codes. A burst of code 47 (UNKNOWN_IDENTIFIER) or code 62 (SYNTAX_ERROR) from a single user usually is a person fat-fingering queries in a console, and is harmless. The codes that matter are the systemic ones: 241 (memory), 252 (too many parts), 60 (unknown table), 159 (timeout). If those dominate, it is real. The card surfaces the top codes precisely so you can tell the difference at a glance.
Why 1%? My instance fails more than that all day on bad ad-hoc queries.
Then your baseline is unusually high and you should tune the threshold in the Sensitivity tab. 1% is the default because a well-behaved production instance, where applications send validated SQL, sits far below it. If interactive ad-hoc usage routinely pushes you over 1%, consider routing those users to a separate instance or profile so their typos do not mask real application-side faults.
The rate is over 1% but my QPS is tiny right now. Should I worry?
Read it carefully. At very low query volume the percentage is volatile: 1 failure out of 30 queries is over 3% but may just be one bad query. The sustained-5-minute gate helps, but at genuinely low volume always look at the absolute count too via Failed Queries (24h). The percentage is most meaningful at production query volumes.
My system.query_log is sampled, so the rate looks lower than reality. How do I fix it?
Set log_queries_probability = 1 (the default) so every query is logged. If a profile or query has lowered it for performance reasons, both this card and any manual query will undercount failures proportionally. Accurate error-rate measurement depends on a complete query log.
Does this include failed inserts and failed background merges?
Failed inserts that throw (including code 252 when the parts cap is hit) appear in system.query_log and are counted. Failed background merges are a different mechanism: they are logged in system.part_log and surfaced through the merge and parts cards, not here. This card is specifically the query-and-insert error rate.
On ClickHouse Cloud, is the query log available?
Yes. system.query_log is populated on ClickHouse Cloud the same way, and the rate query runs in the SQL console. The Cloud monitoring view also charts query error rate over time, which you can use to confirm the card’s spike windows against the managed service’s own readings.