Memory Usage %, CockroachDB - Vortex IQ Help Centre

Card class: Sensitivity • Category: Capacity

At a glance

Memory Usage % is the proportion of available RAM that the busiest CockroachDB node is consuming, expressed as a percentage of the host (or pod) memory limit. It is the headline capacity signal for a cluster under load: CockroachDB holds the Pebble block cache, in-flight SQL operator memory, and per-statement working sets in RAM, so a node trending toward its ceiling is at risk of OOM-kill, which removes the node from the cluster and forces lease transfers and rebalancing. Reading this card answers the operator question, “how much headroom does my hottest node have before something gets evicted or killed?”


What it tracks	Memory Usage % for the selected period: resident memory in use on the highest-utilisation node, divided by the node’s memory limit. The headline shows the worst node, not the cluster average, because OOM is a per-node event.
Data source	The `sys.rss` time-series metric (resident set size) measured against the node memory limit, surfaced through the `_status/nodes` endpoint and the DB Console Hardware dashboard. The SQL memory budget is governed by `--max-sql-memory`; the Pebble cache by `--cache`.
Time window	`RT` (real-time, refreshed on each poll). Memory pressure can build in seconds during an expensive query, so this is read live rather than averaged.
Alert trigger	`> 85%`. Sustained usage above 85% on any node is the warning band before OOM-kill risk becomes material; the default Go runtime and OS overhead leave little safety margin above this.
Roles	DBA, platform, SRE

Calculation

The card reads the sys.rss metric (the node process resident set size, in bytes) for each node and divides it by that node’s configured memory limit. On a bare-metal or VM deployment the limit is the host RAM available to the cockroach process; on Kubernetes it is the pod memory limit. The percentage is computed per node, and the card surfaces the maximum across the cluster, because memory exhaustion is local: one node hitting its ceiling is OOM-killed regardless of how much spare RAM the other nodes have. Two budgets dominate CockroachDB memory and both feed this number. The Pebble block cache (set by --cache, default 25% of system memory in production) is a long-lived allocation that fills early and stays full; it is healthy, not a leak. The SQL memory pool (set by --max-sql-memory, default 25%) grows and shrinks with query complexity: hash joins, large sorts, and wide aggregations spill into it, and a single unbounded query can push a node into the alert band on its own. Anything above those two budgets is Go runtime, goroutine stacks, and OS page cache. Because the cache budget is effectively permanent, a node at rest typically sits around the 30 to 50% band; the movement that matters is the SQL pool expanding under load.

Worked example

A platform team runs a 6-node CockroachDB cluster (each node a 16 GB VM, --cache=4GiB, --max-sql-memory=4GiB) backing the order and inventory services for an ecommerce estate. Snapshot taken on 14 Apr 26 at 20:05 BST, during an evening promotional push.

Node	RSS in use	Memory limit	Usage %	State
n1	6.1 GB	16 GB	38%	healthy
n2	6.4 GB	16 GB	40%	healthy
n3	14.2 GB	16 GB	89%	alert
n4	6.2 GB	16 GB	39%	healthy
n5	5.9 GB	16 GB	37%	healthy
n6	6.3 GB	16 GB	39%	healthy

The card headline reads 89% in the red band, because it reports the worst node (n3), not the 47% cluster mean. The skew is the diagnostic: five nodes sit near 38% while n3 alone is starved. That pattern points away from “the cluster needs more RAM” and toward “one node is doing disproportionate work.” The on-call DBA correlates with two siblings. Range Lease Balance Skew % shows n3 holding 31% of leaseholders against an even share of 17%, confirming n3 is a hot node. Statement Latency p99 (ms) shows p99 climbing on statements routed through n3. The root cause is a reporting query doing a large unbounded sort that landed its leaseholder on n3 and inflated the SQL pool there.

What happens if n3 crosses the line:
  - At ~95% RSS the OS OOM-killer terminates the cockroach process on n3.
  - n3 drops out of node liveness within ~9s; its leases transfer to peers.
  - Affected ranges briefly under-replicate; the balancer re-replicates.
  - SQL connections pinned to n3 are dropped; clients reconnect to survivors.
  - Net effect during the promo: a latency spike and a wave of retryable errors
    on the order-write path at exactly the wrong moment.

The DBA does not wait for the kill. They cancel the offending session, add a LIMIT/index to the report query so it stops spilling, and schedule a review of --max-sql-memory sizing. Three takeaways:

Read the worst node, not the average. A 47% cluster mean looks comfortable and is misleading; OOM is per-node, so the max is the number that can page you.
A single node trending up alone is a workload-distribution problem, not a sizing problem. Adding RAM cluster-wide papers over a hot-range or expensive-query issue that will recur.
The cache budget is supposed to be full. Do not chase the Pebble cache allocation as a leak; chase the SQL pool growth, which is the part that moves with traffic.

Sibling cards

Card	Why pair it with Memory Usage %	What the combination tells you
Database Disk Usage %	The other half of node capacity.	Memory and disk both climbing equals genuine capacity exhaustion; only memory climbing equals a workload or query problem.
Connection Pool Saturation %	Each connection consumes SQL memory.	High saturation plus high memory equals connection pressure inflating the SQL pool; cap the pool before adding RAM.
Range Lease Balance Skew %	Explains a single hot node.	Memory skewed to one node tracking lease skew confirms a hot-range cause rather than under-provisioning.
Statement Latency p99 (ms)	Memory pressure shows up as tail latency.	Rising memory with rising p99 on the same node is the spill-to-pool signature.
Cluster Node Count	Confirms whether an OOM has already happened.	A node-count dip following a memory spike is an OOM-kill you can now confirm.
Statements per Second (live)	The load driver behind memory growth.	Memory rising with QPS is expected scaling; memory rising with flat QPS is a single expensive query.
CockroachDB Health Score	The composite that weights capacity signals.	Sustained memory alert pulls the health score down even while other signals stay green.

Reconciling against the source

To confirm the figure natively, open the DB Console Hardware dashboard and read the Memory Usage time series per node, or query the status endpoint for sys.rss against the node memory limit. The same per-node RSS is exposed in crdb_internal.node_metrics (filter for the sys.rss metric). On CockroachDB Cloud the equivalent appears on the cluster Metrics page under the memory chart for each node.

Reason our number may differ	Direction	Why
Average vs max. The DB Console can show a cluster-wide or per-node view.	Vortex IQ usually higher	The card reports the worst node by design; a cluster average will read lower.
Limit basis. Host RAM vs container/pod limit.	Variable	On Kubernetes the percentage is against the pod limit, not node RAM; confirm which denominator the native view uses.
Sampling moment. RT poll vs a 10s/30s rolled chart.	Marginal	A short SQL-pool spike can be caught live by the card but smoothed away on a longer-resolution chart.
Time zone. Chart axes render in the cluster locale; Vortex IQ aligns to your reporting time zone.	Cosmetic	Axis labels shift; values do not.

For divergence investigations use Vortex Mind to trace the spike back to the session or range that drove it.

Known limitations / FAQs

My node sits at 50% memory all day even when idle. Is that a leak? No. The Pebble block cache (--cache, default 25% of system memory in production) is allocated up front and stays full because it caches hot data. That allocation alone explains a resting baseline around the 30 to 50% band. Look at how far above baseline you climb under load, not the absolute resting figure. The card shows 89% but my monitoring of the host shows 60%. Which is right? Both, measured differently. On a container deployment the card divides by the pod memory limit, while host-level monitoring divides by total node RAM, which can be much larger. Confirm the denominator. The number that predicts OOM-kill is the one against the cgroup/pod limit, which is what this card uses. Why does the headline jump to the worst node instead of the average? Because OOM-kill is a per-node event. A cluster averaging 47% with one node at 89% is one expensive query away from losing that node; an average would hide exactly the condition that pages you. One node is high and the rest are low. Do I add RAM? Usually not first. A single hot node points to skewed leaseholders or one expensive query, not under-provisioning. Check Range Lease Balance Skew % and the slow-statement view; rebalancing or fixing the query is cheaper and more durable than scaling every node. Can I stop a runaway query before it OOMs the node? Yes. Identify the session via SHOW STATEMENTS (or the DB Console SQL Activity page) and run CANCEL QUERY / CANCEL SESSION. Longer term, bound the SQL pool with --max-sql-memory so a single query cannot exhaust the node, and add the missing index or LIMIT so the operator stops spilling. Does raising --max-sql-memory fix high memory usage? It changes the trade-off rather than fixing it. A larger SQL pool lets big queries run in memory instead of erroring, but it also raises the ceiling a single query can reach, which can make OOM more likely, not less. Size it deliberately against your node RAM and the cache budget rather than maximising it. Does the alert fire on a brief spike? The 85% threshold is intended for sustained pressure. A momentary spike from one query that then completes will show on the live card but should not be treated as a standing incident; persistent residence above 85% is the actionable state.

Tracked live in Vortex IQ Nerve Centre

Memory Usage % is one of hundreds of KPI pulses Vortex IQ tracks across CockroachDB and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre