At a glance
The count of queries killed in the last 24 hours because they tried to use more memory than their limit allowed. ClickHouse aborts a query the instant its tracked memory crossesmax_memory_usage(per-query) ormax_server_memory_usage(whole-server), raising error code 241,MEMORY_LIMIT_EXCEEDED. A non-zero reading is rarely “we are short on RAM”; it is almost always one heavy, unbounded query (aGROUP BYover high-cardinality keys, an under-filteredJOIN, or aDISTINCTon a huge column) trying to swallow the instance. The card is your early-warning that a single bad query is putting the whole server at risk.
| Data source | system.query_log over the last 24 hours, counting rows where the query terminated with exception code 241 (MEMORY_LIMIT_EXCEEDED). The card reads the failed-query exception code, not a server gauge. |
| Metric basis | Count of distinct query executions killed by a memory limit, not a count of bytes. One offending query run five times counts as five. |
| What triggers it | A query whose tracked memory crosses its effective ceiling: the per-query max_memory_usage, the per-user max_memory_usage_for_user, or the server-wide max_server_memory_usage. Whichever ceiling is hit first names the abort. |
| Aggregation window | Rolling 24 hours, recomputed on each refresh. |
| Alert threshold | > 0. Any memory-limit kill in 24 hours is worth a look, because it usually means a query pattern exists that can recur and that, if the server-wide ceiling is involved, can destabilise unrelated queries too. |
| What does NOT count | (1) Queries that succeeded but ran close to the limit (use Memory Usage % for headroom); (2) queries killed for other reasons (timeout, TOO_MANY_PARTS, user cancel); (3) memory pressure handled gracefully by spilling to disk (max_bytes_before_external_group_by / _external_sort); (4) OOM-killer kills at the OS level (those crash the process, not a single query, and show up as a restart). |
| Sensitivity note | This is a Sensitivity card: the > 0 default is intentionally strict so the first occurrence surfaces. Teams with known-heavy ad-hoc workloads may raise the threshold per profile. |
| Time window | 24h (rolling) |
| Alert trigger | > 0 memory-limit kills in the last 24 hours. |
| Roles | owner, platform, dba |
Calculation
The engine counts terminal query-log rows that carry the memory-limit exception code:type = 'ExceptionWhileProcessing' selects executions that started and then threw, which is where a memory kill lands (the query had begun consuming memory before it was aborted). exception_code = 241 is the stable numeric code for MEMORY_LIMIT_EXCEEDED; matching on the code rather than the exception text is robust across locales and version wording.
To investigate the which query behind a non-zero count, the engine can surface the offending statements by peak memory:
Worked example
A platform team runs ClickHouse as the backing store for an internal analytics product. The server has 64 GB RAM,max_server_memory_usage set to ~52 GB (0.8 ratio), and max_memory_usage left at the 10 GB per-query default. Snapshot taken on 22 May 26 at 16:40 UTC.
The card jumps from 0 to 14 over an hour. The team drills in:
| event_time (UTC) | user | duration | peak_mem | query head |
|---|---|---|---|---|
| 16:02 | bi_dashboard | 9.8s | 10.0 GB | SELECT user_id, groupArray(event_name) FROM events GROUP BY user_id |
| 16:09 | bi_dashboard | 9.7s | 10.0 GB | SELECT user_id, groupArray(event_name) FROM events GROUP BY user_id |
| … (12 more, same statement, all peaking at 10.0 GB) |
GROUP BY user_id with a groupArray over a high-cardinality key, run 14 times by a refreshing dashboard. Each run climbs straight to the 10 GB per-query ceiling and is killed at code 241.
What the team reads:
- This is one bad query, not a capacity problem. Server memory was never the constraint; the per-query
max_memory_usagewas. The 14 kills are 14 reruns of the same dashboard tile auto-refreshing every ~7 minutes. Throwing more RAM at the server would not help; the per-query ceiling would still bite. - The fix is the query, not the limit.
groupArray(event_name)over every distinctuser_idmaterialises an unbounded array per user. The correct fix is to bound it (groupArrayLast(50)(...), aLIMIT, or a pre-aggregated materialised view), or to enable spilling withmax_bytes_before_external_group_byso the aggregation goes to disk instead of dying. - Sensitivity caught it at the first occurrence. Because the threshold is
> 0, the card went red at 16:02 on the very first kill, before the dashboard refreshed 13 more times. The team could have killed the saved tile or capped the user before the pattern repeated.
max_bytes_before_external_group_by for that user so the GROUP BY spills to disk rather than being killed. The card returns to 0 once the offending statement stops hitting its ceiling.
Sibling cards platform teams should reference together
| Card | Why pair it with MEMORY_LIMIT_EXCEEDED | What the combination tells you |
|---|---|---|
| Memory Usage % | The live headroom gauge that precedes a kill. | High usage % plus rising memory-limit kills equals genuine server-wide pressure; low usage % plus kills equals a single query hitting the per-query ceiling. |
| Failed Queries (24h) | The superset: all query failures, not just memory kills. | If memory kills are most of your failed-query count, memory is your dominant failure mode. |
| Query Error Rate % | The rate view across all error types. | A spike in error rate that lines up with memory kills points the investigation straight at heavy queries. |
| Top 10 Slowest Queries | Heavy queries are usually slow too. | The query killed for memory often appears here just before it dies; same root cause. |
| Query Latency p99 (ms) | Memory pressure inflates tail latency. | p99 rising alongside memory kills means even surviving queries are paying the price. |
| ClickHouse Health Score | The composite that weights error signals. | Repeated memory kills drag the health score down even when ingest and replication are clean. |
| Slow-Query Rate % | The proportion of queries breaching 1s. | Heavy aggregations that die for memory are usually the same ones inflating the slow-query rate. |
Reconciling against the source
Where to look in ClickHouse itself:Why our number may legitimately differ from a manualsystem.query_logfor the authoritative per-execution record: filterexception_code = 241over your window and readquery,user,memory_usage, andexceptionfor the full failure text and the offending SQL.system.metrics(MemoryTracking) andsystem.asynchronous_metricsfor the live server-memory picture at the time of the kills, to distinguish per-query from server-wide pressure.SELECT * FROM system.settings WHERE name LIKE '%max_memory%'to read the effective per-query, per-user, and server ceilings that the killed queries were measured against. ClickHouse Cloud: the service’s query insights / monitoring view exposes the same memory-limit errors against the managed memory ceiling for the service tier.
query_log read:
| Reason | Direction | Why |
|---|---|---|
| Window edges | Slightly different count | The card uses a rolling 24h ending at refresh time; a manual WHERE event_time >= today() uses a calendar boundary, so counts at the edges differ. |
| Time zone | Apparent shift | event_time is in the server time zone; the card displays the window in your profile time zone. |
query_log flush latency | Vortex IQ may lag seconds | query_log is buffered and flushed periodically (default ~7.5s). A kill in the last few seconds may not yet be in the table for either reader. |
| Multi-node summing | Vortex IQ higher | The card sums kills across cluster nodes; a single-node query sees only one node. |
query_log retention/TTL | Both lose old rows | If query_log has a short TTL, very old kills are gone, but the 24h window is well inside any sane retention. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
clickhouse.memory-usage | Server-wide kills should coincide with a high Memory Usage %. | Kills with low Memory Usage % mean a per-query (not server-wide) ceiling is the cause, the most common and least alarming case. |
clickhouse.failed-queries-24h | Memory kills are a subset of failed queries. | If the two move together, memory is your dominant failure cause; if failed queries rise without memory kills, look elsewhere (parts, timeouts). |
Known limitations / FAQs
The card shows kills but my server has plenty of free RAM. How? Because the limit that was hit is almost certainly the per-querymax_memory_usage (default 10 GB) or the per-user max_memory_usage_for_user, not the server-wide ceiling. A single query is allowed only its own budget, no matter how much total RAM is free. Confirm by reading memory_usage on the killed rows in system.query_log: if they all peak at the same round number (10 GB), that is the per-query ceiling, and the fix is the query or a higher per-query limit, not more server RAM.
Should I just raise max_memory_usage to make the kills stop?
Usually not as the first move. Raising the per-query ceiling lets one heavy query consume more of the shared server budget, which can starve or kill other queries (turning a localised per-query problem into a server-wide one). Prefer to bound the query, add a materialised view, or enable spilling to disk via max_bytes_before_external_group_by / max_bytes_before_external_sort. Raise the ceiling only when you have confirmed the query is legitimately large and the server has headroom to spare.
What is the difference between this and an OS OOM-killer event?
MEMORY_LIMIT_EXCEEDED (code 241) is ClickHouse politely aborting one query when its tracked memory crosses a configured ceiling; the server stays up and other queries continue. An OS OOM-kill is the Linux kernel terminating the whole clickhouse-server process when the box runs out of physical memory; that looks like a crash and a restart, not a single failed query. This card counts the former. If you see restarts instead of code-241 errors, your max_server_memory_usage is set too high relative to physical RAM.
Can memory kills affect queries other than the one that was killed?
If the server-wide ceiling (max_server_memory_usage) is the one being hit, yes: ClickHouse will start aborting whichever queries push total memory over the limit, which can include innocent bystanders. That is the dangerous case and why the threshold is > 0. If only the per-query ceiling is hit, the blast radius is limited to the offending query. Use Memory Usage % to tell the two apart.
Why does the same query count multiple times?
The card counts executions, not distinct statements. A dashboard tile or scheduled job that reruns the same offending query every few minutes generates one kill per run, which is exactly the signal you want: a recurring auto-refreshing query is far more urgent than a one-off ad-hoc mistake, because it will keep happening until someone stops it.
We deliberately run huge ad-hoc analytics that sometimes get killed. Can we stop the alerts?
Yes. This is a Sensitivity card with a per-profile threshold. If a known-heavy ad-hoc workload makes occasional kills normal for you, raise the threshold above your typical daily count, or scope a separate user with its own max_memory_usage_for_user and exclude it. Be cautious: a higher threshold also hides a genuinely new offending query, so prefer fixing or bounding the heavy queries over muting the card.
Does spilling to disk count as a kill?
No. When max_bytes_before_external_group_by or max_bytes_before_external_sort is set, ClickHouse spills the intermediate state to disk and the query succeeds, slower but alive. Those queries never reach code 241 and never appear on this card. Enabling spilling is one of the cleanest ways to convert a recurring memory kill into a (slower) successful query.