At a glance
Failed Queries (24h) counts every query that ended in an exception over the trailing 24 hours: the rows insystem.query_logwheretype = 'ExceptionWhileProcessing'. Unlike a slow query, a failed query returned nothing useful to its caller, a dashboard tile broke, an ETL job errored, an API call 500’d. For a DBA, this is the rawest “what broke?” counter on the cluster. A low, flat number is normal background noise (a malformed ad-hoc query here and there); a spike to triple digits means something systemic is failing, and the> 100alert exists to catch exactly that.
| What it tracks | The count of failed query executions in the last 24 hours: system.query_log rows where type = 'ExceptionWhileProcessing'. |
| Data source | system.query_log, filtered on type = 'ExceptionWhileProcessing' over a rolling 24-hour window. The exception_code and exception columns provide the per-error breakdown behind the headline. |
| Time window | 24h: a rolling 24-hour count, so the number reflects the last full day of failures, not just the current minute. |
| Alert trigger | > 100. More than 100 failures in 24 hours is well above normal background noise for most clusters and indicates a systemic fault: a broken client, a bad deploy, an exhausted pool, or a capacity limit being hit repeatedly. |
| Roles | owner, engineering |
Calculation
The card runs a count againstsystem.query_log over the trailing day:
system.query_log records a row for each query lifecycle event. The type column distinguishes QueryStart, QueryFinish, ExceptionBeforeStart (the query failed to parse or plan, before execution), and ExceptionWhileProcessing (the query started executing and then threw). This card counts the latter, the failures that occurred during execution: out-of-memory kills, Too Many Parts halts, timeouts, divide-by-zero, type errors mid-stream, and similar. The per-error drill-down groups by exception_code (the numeric ClickHouse error code, for example 241 for MEMORY_LIMIT_EXCEEDED, 252 for TOO_MANY_PARTS) so you can see whether the spike is one repeated fault or many different ones.
Worked example
A data team powers an internal BI tool from ClickHouse. Failed Queries (24h) normally sits around 8 to 15 (the occasional analyst typo). On 09 May 26 the on-call sees the card at 214, well past the> 100 alert. They open the breakdown grouped by exception_code:
exception_code | Error name | Count | Share |
|---|---|---|---|
| 241 | MEMORY_LIMIT_EXCEEDED | 168 | 79% |
| 159 | TIMEOUT_EXCEEDED | 31 | 14% |
| 62 | SYNTAX_ERROR | 12 | 6% |
| 252 | TOO_MANY_PARTS | 3 | 1% |
max_memory_usage on every refresh. Each user who opens the dashboard triggers a fresh MEMORY_LIMIT_EXCEEDED, and 168 failures accumulated through the morning. The timeouts (159) are a secondary symptom: the same heavy query holds memory and slows neighbours until they time out. The syntax errors (62) are ordinary noise unrelated to the incident.
The fix is not “give the server more memory”. It is to fix the offending query: add a date filter, lower its max_memory_usage per-query override and let it spill, or pre-aggregate the cohort data into a materialised view. The DBA pairs this with MEMORY_LIMIT_EXCEEDED (24h) to confirm the 168 count and with Top 10 Slowest Queries to grab the exact SQL.
Three takeaways:
- The headline number is a smoke alarm; the
exception_codebreakdown is the fire. 214 failures could be 214 different problems or one problem 214 times. Always group by error code before you act. Here, 79% was a single fault. - One bad query can dominate the whole count. A single memory-hungry tile generated four-fifths of the failures. Reaching for infrastructure (more RAM) would have masked, not fixed, a query problem.
- Failed is categorically worse than slow. A slow query eventually returns; a failed query returns an error to a human or a downstream job. A spike here means users and pipelines are seeing broken results right now, which is why it carries a Hero, alert-bearing class.
Sibling cards
| Card | Why pair it with Failed Queries (24h) | What the combination tells you |
|---|---|---|
| Query Error Rate % | The rate view to this card’s absolute count. | High count but low rate equals high total query volume; high count and high rate equals a real reliability problem. |
| MEMORY_LIMIT_EXCEEDED (24h) | Isolates the single most common cause (error 241). | If this card’s spike is mostly 241, the two move together. |
| Too Many Parts Errors (24h) | Isolates error 252, the ingest-halting failure. | Failures dominated by 252 means inserts are broken, not reads. |
| Top 10 Slowest Queries | Grabs the SQL behind the failures. | Same fingerprint topping both lists equals one query causing both slowness and failures. |
| Query Latency p99 (ms) | The tail that often precedes timeouts. | p99 climbing then failures spiking equals queries degrading into timeouts. |
| ClickHouse Health Score | Failures are a 15% component of the composite. | Health Score down with failures up equals the failures are the dominant cause. |
| Memory Usage % | The capacity signal behind OOM-class failures. | Memory high plus failures high equals the instance is memory-bound. |
Reconciling against the source
Confirm the count directly against the server:system.query_log is queryable, and the service Monitoring tab surfaces error counts; reconcile against the SQL above for the authoritative figure.
Why our number may legitimately differ:
| Reason | Direction | Why |
|---|---|---|
ExceptionBeforeStart excluded | Our number lower | We count execution-time failures (ExceptionWhileProcessing). Parse/plan failures (ExceptionBeforeStart) are a separate type and not in this count by default. |
query_log flush latency | Our number briefly lower | system.query_log is buffered and flushed on an interval (default 7.5s); the most recent failures may not be written yet. |
| Log retention / TTL | Our number lower if TTL < 24h | If system.query_log has a TTL shorter than 24 hours, the oldest part of the window is already pruned. |
| Replica scope | Variable | On a cluster, system.query_log is per-node. A whole-cluster count needs clusterAllReplicas; the card reports the connected node unless configured otherwise. |
| Sampling instant | Marginal | The rolling 24h window slides; a hand-run query a minute later sees a slightly different set. |
Known limitations / FAQs
My count is high but the cluster feels fine. Is this a false alarm? Probably not, but check whether the failures are concentrated in one error code and one query fingerprint. A cluster can serve thousands of queries fine while one broken dashboard tile fails repeatedly. The card is surfacing real failures that specific users or jobs are seeing, even if aggregate health looks acceptable. Group byexception_code to confirm.
Does this count parse errors and bad SQL from analysts?
It counts execution-time exceptions (ExceptionWhileProcessing). A purely malformed query that fails to parse is logged as ExceptionBeforeStart and is not in this count. A query that parses but then errors mid-execution (type mismatch, divide by zero, OOM) is counted. So a handful of analyst typos that actually start executing will register; pure syntax rejections largely will not.
The number dropped to zero after I restarted the server. Did I fix it?
No. system.query_log is per-node and the rolling 24h window still holds the pre-restart failures unless retention pruned them, but if your query_log has a short TTL or the table was reset, the count can appear to drop. A restart does not fix the underlying cause; confirm by re-running the breakdown query and watching whether new failures resume.
How do I tell a transient spike from a sustained problem?
Look at the time distribution within the 24h window: GROUP BY toStartOfHour(event_time). A single bad hour that recovered is a transient (a deploy that was rolled back); a steady hourly rate is a live, ongoing fault that needs a fix now.
Why 100 as the threshold? My cluster never exceeds 20.
100 is a generic default. If your baseline is single digits, lower the threshold in the Sensitivity tab to something like 40 so a doubling of your normal rate alerts. A high-volume cluster running millions of queries a day may legitimately see more than 100 trivial failures and should raise the threshold and rely more on Query Error Rate % instead.
On a multi-node cluster, is this the whole cluster or one node?
By default it reflects the node Vortex IQ connects to, because system.query_log is local to each node. For a cluster-wide figure, the engine can be configured to query clusterAllReplicas(...). If your queries are routed across nodes, a single-node count understates total failures.
The failures are all timeouts (error 159). Is that a failure or a slowness problem?
Both. A TIMEOUT_EXCEEDED is a query that ran too long and was killed, so it counts as a failure here, but the root cause is performance, not correctness. Pair with Query Latency p99 (ms) and Top 10 Slowest Queries to find and optimise the offending queries rather than just raising the timeout.