At a glance
Memory Usage % is the proportion of available RAM that the busiest CockroachDB node is consuming, expressed as a percentage of the host (or pod) memory limit. It is the headline capacity signal for a cluster under load: CockroachDB holds the Pebble block cache, in-flight SQL operator memory, and per-statement working sets in RAM, so a node trending toward its ceiling is at risk of OOM-kill, which removes the node from the cluster and forces lease transfers and rebalancing. Reading this card answers the operator question, “how much headroom does my hottest node have before something gets evicted or killed?”
| What it tracks | Memory Usage % for the selected period: resident memory in use on the highest-utilisation node, divided by the node’s memory limit. The headline shows the worst node, not the cluster average, because OOM is a per-node event. |
| Data source | The sys.rss time-series metric (resident set size) measured against the node memory limit, surfaced through the _status/nodes endpoint and the DB Console Hardware dashboard. The SQL memory budget is governed by --max-sql-memory; the Pebble cache by --cache. |
| Time window | RT (real-time, refreshed on each poll). Memory pressure can build in seconds during an expensive query, so this is read live rather than averaged. |
| Alert trigger | > 85%. Sustained usage above 85% on any node is the warning band before OOM-kill risk becomes material; the default Go runtime and OS overhead leave little safety margin above this. |
| Roles | DBA, platform, SRE |
Calculation
The card reads thesys.rss metric (the node process resident set size, in bytes) for each node and divides it by that node’s configured memory limit. On a bare-metal or VM deployment the limit is the host RAM available to the cockroach process; on Kubernetes it is the pod memory limit. The percentage is computed per node, and the card surfaces the maximum across the cluster, because memory exhaustion is local: one node hitting its ceiling is OOM-killed regardless of how much spare RAM the other nodes have.
Two budgets dominate CockroachDB memory and both feed this number. The Pebble block cache (set by --cache, default 25% of system memory in production) is a long-lived allocation that fills early and stays full; it is healthy, not a leak. The SQL memory pool (set by --max-sql-memory, default 25%) grows and shrinks with query complexity: hash joins, large sorts, and wide aggregations spill into it, and a single unbounded query can push a node into the alert band on its own. Anything above those two budgets is Go runtime, goroutine stacks, and OS page cache. Because the cache budget is effectively permanent, a node at rest typically sits around the 30 to 50% band; the movement that matters is the SQL pool expanding under load.
Worked example
A platform team runs a 6-node CockroachDB cluster (each node a 16 GB VM,--cache=4GiB, --max-sql-memory=4GiB) backing the order and inventory services for an ecommerce estate. Snapshot taken on 14 Apr 26 at 20:05 BST, during an evening promotional push.
| Node | RSS in use | Memory limit | Usage % | State |
|---|---|---|---|---|
| n1 | 6.1 GB | 16 GB | 38% | healthy |
| n2 | 6.4 GB | 16 GB | 40% | healthy |
| n3 | 14.2 GB | 16 GB | 89% | alert |
| n4 | 6.2 GB | 16 GB | 39% | healthy |
| n5 | 5.9 GB | 16 GB | 37% | healthy |
| n6 | 6.3 GB | 16 GB | 39% | healthy |
LIMIT/index to the report query so it stops spilling, and schedule a review of --max-sql-memory sizing. Three takeaways:
- Read the worst node, not the average. A 47% cluster mean looks comfortable and is misleading; OOM is per-node, so the max is the number that can page you.
- A single node trending up alone is a workload-distribution problem, not a sizing problem. Adding RAM cluster-wide papers over a hot-range or expensive-query issue that will recur.
- The cache budget is supposed to be full. Do not chase the Pebble cache allocation as a leak; chase the SQL pool growth, which is the part that moves with traffic.
Sibling cards
| Card | Why pair it with Memory Usage % | What the combination tells you |
|---|---|---|
| Database Disk Usage % | The other half of node capacity. | Memory and disk both climbing equals genuine capacity exhaustion; only memory climbing equals a workload or query problem. |
| Connection Pool Saturation % | Each connection consumes SQL memory. | High saturation plus high memory equals connection pressure inflating the SQL pool; cap the pool before adding RAM. |
| Range Lease Balance Skew % | Explains a single hot node. | Memory skewed to one node tracking lease skew confirms a hot-range cause rather than under-provisioning. |
| Statement Latency p99 (ms) | Memory pressure shows up as tail latency. | Rising memory with rising p99 on the same node is the spill-to-pool signature. |
| Cluster Node Count | Confirms whether an OOM has already happened. | A node-count dip following a memory spike is an OOM-kill you can now confirm. |
| Statements per Second (live) | The load driver behind memory growth. | Memory rising with QPS is expected scaling; memory rising with flat QPS is a single expensive query. |
| CockroachDB Health Score | The composite that weights capacity signals. | Sustained memory alert pulls the health score down even while other signals stay green. |
Reconciling against the source
To confirm the figure natively, open the DB Console Hardware dashboard and read the Memory Usage time series per node, or query the status endpoint forsys.rss against the node memory limit. The same per-node RSS is exposed in crdb_internal.node_metrics (filter for the sys.rss metric). On CockroachDB Cloud the equivalent appears on the cluster Metrics page under the memory chart for each node.
| Reason our number may differ | Direction | Why |
|---|---|---|
| Average vs max. The DB Console can show a cluster-wide or per-node view. | Vortex IQ usually higher | The card reports the worst node by design; a cluster average will read lower. |
| Limit basis. Host RAM vs container/pod limit. | Variable | On Kubernetes the percentage is against the pod limit, not node RAM; confirm which denominator the native view uses. |
| Sampling moment. RT poll vs a 10s/30s rolled chart. | Marginal | A short SQL-pool spike can be caught live by the card but smoothed away on a longer-resolution chart. |
| Time zone. Chart axes render in the cluster locale; Vortex IQ aligns to your reporting time zone. | Cosmetic | Axis labels shift; values do not. |
Known limitations / FAQs
My node sits at 50% memory all day even when idle. Is that a leak? No. The Pebble block cache (--cache, default 25% of system memory in production) is allocated up front and stays full because it caches hot data. That allocation alone explains a resting baseline around the 30 to 50% band. Look at how far above baseline you climb under load, not the absolute resting figure.
The card shows 89% but my monitoring of the host shows 60%. Which is right?
Both, measured differently. On a container deployment the card divides by the pod memory limit, while host-level monitoring divides by total node RAM, which can be much larger. Confirm the denominator. The number that predicts OOM-kill is the one against the cgroup/pod limit, which is what this card uses.
Why does the headline jump to the worst node instead of the average?
Because OOM-kill is a per-node event. A cluster averaging 47% with one node at 89% is one expensive query away from losing that node; an average would hide exactly the condition that pages you.
One node is high and the rest are low. Do I add RAM?
Usually not first. A single hot node points to skewed leaseholders or one expensive query, not under-provisioning. Check Range Lease Balance Skew % and the slow-statement view; rebalancing or fixing the query is cheaper and more durable than scaling every node.
Can I stop a runaway query before it OOMs the node?
Yes. Identify the session via SHOW STATEMENTS (or the DB Console SQL Activity page) and run CANCEL QUERY / CANCEL SESSION. Longer term, bound the SQL pool with --max-sql-memory so a single query cannot exhaust the node, and add the missing index or LIMIT so the operator stops spilling.
Does raising --max-sql-memory fix high memory usage?
It changes the trade-off rather than fixing it. A larger SQL pool lets big queries run in memory instead of erroring, but it also raises the ceiling a single query can reach, which can make OOM more likely, not less. Size it deliberately against your node RAM and the cache budget rather than maximising it.
Does the alert fire on a brief spike?
The 85% threshold is intended for sustained pressure. A momentary spike from one query that then completes will show on the live card but should not be treated as a standing incident; persistent residence above 85% is the actionable state.