Connection Pool at >90% Saturation, CockroachDB

Card class: Hero • Category: Nerve Centre

At a glance

Alerts for Connection Pool at >90% Saturation: the firing list of moments where open SQL connections crossed 90% of the cluster’s configured connection ceiling and stayed there for a sustained minute. This is the “we are about to start refusing connections” warning. When this card lights up, application workers are queuing for a session slot, request latency climbs, and the next deploy or traffic spike will start throwing connection errors. For a DBA or SRE team this is a capacity emergency in slow motion: you usually have minutes, not seconds, to act before clients see failures.


What it tracks	Alerts for Connection Pool at >90% Saturation: each firing is a sustained breach of the 90% saturation threshold.
Data source	Ratio of `sql.conns` (open SQL connections, summed across live nodes) to the configured ceiling: the cluster setting `server.max_connections_per_gateway` multiplied by gateway nodes, or the CockroachDB Cloud plan connection limit. Same series shown in the DB Console SQL dashboard “Open SQL Sessions” panel.
Metric basis	Saturation percentage, not raw connection count. A small cluster at 95% is more urgent than a large cluster at 60% even though the large one has more absolute connections.
Time window	`RT`, evaluated continuously; the alert requires the breach to be sustained for 1 minute to avoid firing on momentary bursts.
Alert trigger	`>90% sustained 1m`: saturation above 90% held for at least one continuous minute.
What counts as a firing	A minute-long window where pool saturation stayed above 90%. A 20-second spike to 99% that recovers does not fire; a steady 91% for 60 seconds does.
What does NOT fire	(1) Transient spikes shorter than the 1-minute sustain; (2) High connection count that is still comfortably under 90% of the ceiling; (3) Per-node hotspots that average out below 90% cluster-wide (watch Connection Pool Saturation % for the per-node view).
Roles	DBA, platform, SRE

Calculation

The underlying signal is connection-pool saturation, defined as:

saturation% = (open SQL connections / configured connection ceiling) * 100

The numerator is the cluster-wide sum of the sql.conns gauge (open SQL connections per node). The denominator is the connection ceiling: on self-hosted clusters this is server.max_connections_per_gateway applied across the gateway nodes that accept client traffic; on CockroachDB Cloud it is the plan’s connection limit (visible on the cluster’s Overview and enforced by the managed proxy). The alert engine evaluates saturation on every poll and opens a firing only when the value stays above 90% for a continuous 60-second window. The 1-minute sustain is deliberate: connection counts are spiky (a batch job opening 50 sessions then releasing them is normal), and alerting on every spike would bury the genuine “the pool is full and staying full” signal. Each firing carries the peak saturation reached, the gateway node(s) most loaded, and the open-connection count at trigger time so the on-call engineer can size the response.

Worked example

A platform team runs a 5-node CockroachDB self-hosted cluster backing the order and inventory services for a high-traffic retail API. server.max_connections_per_gateway is set to 500, and all 5 nodes accept client traffic, giving a cluster ceiling of 2,500 connections. Snapshot taken on 14 Apr 26 at 20:05 BST, during an evening flash-sale ramp.

Time (BST)	Open connections	Ceiling	Saturation	State
19:55	1,420	2,500	57%	healthy
20:01	2,180	2,500	87%	climbing
20:03	2,295	2,500	92%	breach starts
20:04	2,340	2,500	94%	sustained
20:05	2,360	2,500	94%	alert fires

Saturation crossed 90% at 20:03 and stayed above it. By 20:04 the breach had been sustained for a full minute, so the card fired at 20:05 with peak saturation 94% and 2,360 open connections, concentrated on gateway nodes 2 and 4 (which sit behind the load balancer’s primary targets). What the on-call SRE does with this:

Confirm the cause is real demand, not a leak. Pull Connections In Use trend. A smooth ramp tracking traffic means genuine load; a vertical climb with flat request volume means a client pool is leaking sessions (often a service that opens connections but never returns them to its pool).
Check whether it is hurting yet. Cross-read Statement Latency p95 (ms). If p95 has climbed in step with saturation, application workers are already waiting on session acquisition.
Relieve pressure in the right order. Short term: shed non-critical sessions (pause the analytics/BI pool, throttle the batch importer). Medium term: raise server.max_connections_per_gateway if node memory allows, or add a gateway node to widen the ceiling. Correct long-term fix: front the cluster with a connection pooler so thousands of app threads multiplex onto a bounded server-side pool.

Cost framing of leaving it unaddressed:
  - At 94% with traffic still ramping, the next +6% of demand exhausts the pool.
  - Once full, new connections are refused: app workers throw "too many clients" errors.
  - During a flash sale, refused connections map directly to failed checkouts.
  - Acting at 94% (now) is a 5-minute config change; acting after exhaustion is an incident.

Three takeaways for the team:

90% is the act line, not the panic line. The 1-minute sustain means a firing is a real, settled condition, not noise. Treat every firing as “fix within minutes”, because the headroom above 90% disappears fast under load.
Saturation, not count, is the truth. “2,360 connections” sounds large but is meaningless without the ceiling. The same 2,360 on a 10-node cluster with a 5,000 ceiling is a calm 47%. Always read the percentage.
A pooler is the structural answer. Repeated firings during normal peaks mean the cluster is being asked to manage connection concurrency it should not. A pgbouncer-style pooler in front of CockroachDB bounds server-side connections regardless of how many app threads exist.

Sibling cards

Card	Why pair it with Connection Pool at >90% Saturation	What the combination tells you
Connection Pool Saturation %	The continuous gauge this alert is built on.	The alert tells you it crossed 90%; the gauge shows the live value and per-node spread.
Connections In Use	The raw numerator behind saturation.	Smooth climb equals real demand; vertical climb at flat traffic equals a pool leak.
Statement Latency p95 (ms)	The first place saturation pain shows up for users.	p95 rising with saturation means workers are already waiting on session acquisition.
Statement Error Rate %	Where exhaustion finally surfaces as errors.	Error rate climbing after saturation equals connections now being refused.
Memory Usage %	Each connection consumes memory.	High saturation plus high memory means raising the ceiling is unsafe; add nodes instead.
Statements per Second (live)	The workload driving connection demand.	QPS flat while connections climb confirms a leak rather than load.
CockroachDB Health Score	The executive composite that this alert feeds.	A sustained pool breach drags the health score down even while ranges stay healthy.
CRDB Pool Saturation vs Traffic Burst	The cross-channel view tying saturation to front-end traffic.	Saturation breach during a traffic burst is expected; during quiet traffic it is a leak.

Reconciling against the source

Where to look natively:

DB Console SQL dashboard (“Open SQL Sessions” panel) for the live sql.conns series per node. SHOW SESSIONS; or SELECT count(*) FROM crdb_internal.cluster_sessions; for the exact open-connection count at a moment. SHOW CLUSTER SETTING server.max_connections_per_gateway; to confirm the ceiling the saturation percentage divides by. CockroachDB Cloud Metrics tab plots the same connection series, and the cluster Overview shows the plan connection limit.

Why our number may legitimately differ from the native view:

Reason	Direction	Why
Ceiling source	Either way	Vortex IQ divides by the configured `max_connections_per_gateway` ceiling (or the Cloud plan limit). If the setting was changed but not reloaded, the native panel may compute against a stale denominator.
Per-node vs cluster	Vortex IQ may read lower	This card uses cluster-wide saturation; the DB Console panel can show a single hot node at a higher local percentage.
Poll cadence	Brief lag	Connection counts move per second. A polled saturation value can trail the instantaneous DB Console graph by one poll interval.
Sustain filter	Vortex IQ fires less often	The native graph shows every momentary spike to 90%+; this card only fires on a sustained 1-minute breach.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
CRDB Pool Saturation vs Traffic Burst	A firing should coincide with a front-end traffic burst.	A firing with no burst points to a connection leak in an application service, not real demand.
CRDB Statements Spike vs Ecom Order Rate	Saturation breaches usually accompany a statements spike.	Connections climbing without a statements spike means idle sessions are accumulating, not active queries.

Known limitations / FAQs

My connection count looks high but this card has not fired. Why? The card alerts on saturation (count divided by the configured ceiling), not on the raw count, and only after a sustained 1-minute breach above 90%. A high absolute count that is still under 90% of your ceiling, or a brief spike that recovers within a minute, will not fire. Check Connection Pool Saturation % for the live percentage. Should I just raise server.max_connections_per_gateway whenever this fires? Only if node memory allows. Each connection consumes server memory, so raising the ceiling on a memory-constrained cluster trades a connection wall for an out-of-memory risk. Read Memory Usage % first. The durable fix for repeated firings is a connection pooler (pgbouncer-style) in front of the cluster so thousands of app threads multiplex onto a bounded server-side pool. On CockroachDB Cloud I cannot change max_connections_per_gateway. What is the ceiling then? On Cloud the connection limit is set by your plan and enforced by the managed proxy, not by the cluster setting. Vortex IQ divides by that plan limit. If you are repeatedly saturating it, the levers are: add a connection pooler, reduce client pool sizes, or move to a larger plan tier. The alert fired but our application is not throwing errors yet. Is it a false alarm? No. 90% is the early-warning line precisely so you can act before exhaustion. At 90%+ you have little headroom; the next traffic increment or deploy can push you to 100%, at which point new connections are refused with “too many clients” errors. Treat the firing as a window to act, not as proof that damage has already happened. Why a 1-minute sustain instead of firing immediately at 90%? Connection counts are inherently spiky: a batch import or a BI refresh can open dozens of sessions briefly and release them. Firing on every momentary spike would bury the genuine “the pool is full and staying full” signal. The 1-minute sustain confirms the condition has settled and is not transient. Can a single hot gateway node trigger this even if the cluster average is under 90%? This card evaluates cluster-wide saturation, so a single hot node averaging out below 90% will not fire it. To catch per-node hotspots, watch Connection Pool Saturation %, which exposes the per-node spread, and check whether your load balancer is distributing connections evenly across gateways. What is the relationship between this card and pool saturation on the client side? This card measures the server-side ceiling (CockroachDB’s view of open sessions). Your application’s client pool (HikariCP, pgbouncer, etc.) has its own limit. Client-side pool exhaustion can occur even while the server is comfortable, and vice versa. When this server-side card fires, also inspect client pool metrics; the two together tell you whether to widen the server ceiling or resize client pools.

Tracked live in Vortex IQ Nerve Centre

Connection Pool at >90% Saturation is one of hundreds of KPI pulses Vortex IQ tracks across CockroachDB and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre