Connection Pool at >90% Saturation, PostgreSQL

Card class: Hero • Category: Nerve Centre

At a glance

This is the alert card that fires when your PostgreSQL connection pool crosses 90% of max_connections and holds there. In plain terms: the database is running out of places to put new connections. When the last few slots fill, new connection attempts get rejected with “too many clients already” (FATAL: sorry, too many clients already), and every application that talks to the database starts throwing connection errors at the same time. For an SRE or DBA this is one of the highest-signal alerts on the board, because pool exhaustion is the single most common cause of a total, instant database outage that looks like a crash but is not one.


What it tracks	Whether `(active + idle + idle-in-transaction) / max_connections` has crossed 90% and stayed there for at least one minute. The card surfaces the breach as an alert entry, not a continuous gauge.
Data source	Backend count from `pg_stat_activity` (one row per backend) compared against the `max_connections` GUC. On instances fronted by PgBouncer or pgpool-II, the proxy’s own pool counters are read in addition to the server-side count. See the live gauge sibling Connection Pool Saturation % for the underlying calculation.
Time window	`RT` (real-time, evaluated on each poll, roughly every 60 seconds).
Alert trigger	`>90% sustained 1m`. A single momentary spike does not fire; the breach must hold across the one-minute evaluation window to avoid paging on transient connection storms that drain on their own.
Roles	dba, platform, sre

Calculation

The breach condition is computed directly from the two numbers PostgreSQL itself exposes:

saturation = count(pg_stat_activity backends) / max_connections
fire when:  saturation > 0.90  AND  the condition holds for >= 60 seconds

The backend count is SELECT count(*) FROM pg_stat_activity (all states: active, idle, idle in transaction, idle in transaction (aborted), plus background workers that count against the limit). The denominator is SHOW max_connections (or SELECT setting FROM pg_settings WHERE name = 'max_connections'). A subtlety that matters: PostgreSQL reserves a handful of slots above the application-visible limit via superuser_reserved_connections (default 3) and, from PostgreSQL 16, reserved_connections. Those reserved slots exist precisely so a DBA can still log in to fix the problem when the pool is full. The card measures saturation against the full max_connections, so a reading of “100%” still leaves the superuser slots free for you to connect and triage. When a connection pooler is present (PgBouncer in transaction mode is the common case), there are two distinct pools to watch: the client-side pool the application fills, and the server-side pool PgBouncer maintains against PostgreSQL. The card reads both and fires on whichever crosses 90% first, because exhaustion at either layer produces the same application-visible failure.

Worked example

A platform team runs a primary PostgreSQL 15 instance on a managed service backing a fashion retailer’s order and inventory APIs. max_connections is set to 200. PgBouncer sits in front in transaction-pooling mode. Snapshot taken on 14 Apr 26 at 20:05 BST, the start of a flash-sale window.

Metric	Reading	Threshold	State
Backends in `pg_stat_activity`	187	n/a	rising
`max_connections`	200	n/a	fixed
Server-side saturation	93.5%	90%	BREACH
Duration above 90%	2m 10s	1m	BREACH
`idle in transaction` backends	41	n/a	suspicious

The card fires. The headline reads Connection Pool at 93.5% (BREACH, 2m). The on-call DBA reads three things immediately:

The pool is nearly full and not draining. At 187 of 200 with traffic still climbing, the next minute will hit the wall. New API requests are about to start failing with FATAL: sorry, too many clients already.
41 of those backends are idle in transaction. That is the tell. These are connections that opened a transaction (BEGIN) and then went quiet without committing or rolling back, often because an application thread is blocked on an external call while holding a database connection open. Cross-reference Idle-in-Transaction Backends.
This is a pooling problem, not a query problem. Latency (see Query Latency p95 (ms)) is still healthy at 38ms. The database can do the work; it just cannot accept the connections.

Triage path during the breach:
  1. Identify the idle-in-tx offenders:
     SELECT pid, state, now() - state_change AS idle_for, query
     FROM pg_stat_activity
     WHERE state = 'idle in transaction'
     ORDER BY state_change ASC;
  2. If a deploy just shipped, suspect a code path that forgot to commit.
  3. Short-term relief: terminate the oldest idle-in-tx backends:
     SELECT pg_terminate_backend(pid)
     FROM pg_stat_activity
     WHERE state = 'idle in transaction'
       AND now() - state_change > interval '5 minutes';
  4. Set idle_in_transaction_session_timeout so the DB self-heals next time.
  5. Confirm PgBouncer pool_size and the app's own pool size are not summed
     above what PostgreSQL can serve.

Three takeaways:

Pool exhaustion is almost never about traffic alone. A healthy app with correct connection handling will queue at the pooler, not flood the server. When this fires, the first suspect is a connection leak (idle-in-transaction), not “we got popular”.
The fix is usually configuration, not a bigger box. Raising max_connections to paper over a leak makes the eventual failure worse, because each backend costs memory (work_mem per sort, plus per-backend overhead). The durable fix is a pooler sized correctly plus idle_in_transaction_session_timeout.
This alert is a leading indicator of a total outage. Unlike a slow query, which degrades one endpoint, a full pool takes the whole database offline for every caller at once. Treat a sustained breach as a near-page-1 event.

Sibling cards

Card	Why pair it with Connection Pool at >90% Saturation	What the combination tells you
Connection Pool Saturation %	The continuous gauge this alert is built on.	The gauge shows the climb; this card marks the moment it became dangerous.
Idle-in-Transaction Backends	The most common cause of slow-building saturation.	High idle-in-tx plus rising saturation equals a connection leak, not real load.
Connections In Use	The raw backend count behind the percentage.	Confirms whether the breach is many small users or a few stuck sessions.
Connection Errors (24h)	The downstream symptom once the pool is full.	A spike here right after the breach confirms rejected connections reached clients.
Queries per Second (live)	Distinguishes a real traffic surge from a leak.	Saturation up with flat QPS equals a leak; both up together equals genuine demand.
PgBouncer Pool Saturation vs Traffic Burst	The cross-channel view tying pool pressure to storefront traffic.	Confirms whether the burst is paying customers or scrapers.
PostgreSQL Health Score	The composite that drops sharply when the pool saturates.	One sustained breach pulls the composite below its healthy band.

Reconciling against the source

Where to look in PostgreSQL’s own tooling:

Run SELECT count(*) FROM pg_stat_activity; and compare against SHOW max_connections;. The ratio is the saturation the card reports. Break it down by state with SELECT state, count(*) FROM pg_stat_activity GROUP BY state; to see how much of the pool is genuinely working versus idle or stuck. On PgBouncer, connect to the admin console (psql -p 6432 pgbouncer) and run SHOW POOLS; and SHOW CLIENTS; to see client-side queueing the server-side count cannot show. On a managed service (Amazon RDS / Aurora, Google Cloud SQL, Azure Database for PostgreSQL), the DatabaseConnections metric in the provider’s monitoring console is the same number, sampled at the provider’s interval.

Why our number may legitimately differ from PostgreSQL’s own view:

Reason	Direction	Why
Poll timing	Brief lag	The card polls roughly every 60s; a `pg_stat_activity` query you run by hand is instantaneous. A fast spike can show on one and not the other.
Reserved slots	Card may read 100% while you can still connect	Saturation is measured against full `max_connections`; `superuser_reserved_connections` keeps DBA slots free above the application ceiling.
Pooler layer	Card may fire when the server looks calm	If PgBouncer’s client pool saturates while its server pool is small, the server-side count stays low but clients are rejected. The card reads both layers.
Background workers	Card slightly higher	Parallel workers, autovacuum workers, and logical replication workers occupy `pg_stat_activity` rows and count against the limit.
Managed-service sampling	Provider console smoother	RDS / Cloud SQL sample `DatabaseConnections` on a coarser interval and may miss a sub-minute breach the card catches.

Known limitations / FAQs

The card fires at 90% but my app is fine. Why page on that? 90% sustained for a minute is a near-miss, not a current outage, and that is the point. At 100% you are already rejecting connections and customers are seeing errors. The alert is deliberately a leading indicator so you can act in the gap between “nearly full” and “rejecting clients”. If your workload routinely runs at 92% safely, raise the sensitivity threshold for this instance in the Sensitivity tab rather than ignoring the page. Should I just increase max_connections to stop this firing? Almost never as the first move. Each backend reserves memory (per-backend overhead plus work_mem for sorts and hashes), so a higher ceiling raises your worst-case memory footprint and can trade a connection outage for an out-of-memory one. The correct fix for most workloads is a connection pooler (PgBouncer in transaction mode) sized so the application never opens more server-side connections than PostgreSQL can comfortably hold, plus idle_in_transaction_session_timeout to reclaim leaked sessions. What is the difference between this and the live saturation gauge? Connection Pool Saturation % is a continuous gauge you glance at; it is green most of the time. This card is an alert entry that only appears when the gauge crosses 90% and holds. Use the gauge for trend, this card for “wake me up”. My instance is behind PgBouncer. Which pool does the alert watch? Both. PgBouncer maintains a client-facing pool and a server-facing pool against PostgreSQL. Exhaustion at either layer produces the same too many clients symptom for the application, so the card fires on whichever crosses 90% first. Check SHOW POOLS; in the PgBouncer admin console to see which layer is the bottleneck. Connections jumped to 95% for ten seconds and the card did not fire. Is it broken? No, that is the sustained 1m qualifier working as designed. Connection storms (a deploy restart, a pod rollout, a brief retry burst) routinely spike the count for a few seconds and drain on their own. Paging on those would be noise. The breach must hold across the one-minute evaluation window. Does autovacuum or replication count against the pool? Yes. Autovacuum workers, parallel query workers, and walsender processes for streaming replication all occupy rows in pg_stat_activity and count against max_connections. On a busy instance with several replicas and parallel queries enabled, the non-application share can be meaningful, so size the ceiling with headroom for them. Can I see which application is holding the connections? Yes, if your application sets application_name on its connections (most drivers do, or you can set it in the connection string). Run SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY 2 DESC; to attribute the pool by service. This is the fastest way to find which deploy or service introduced a leak.

Tracked live in Vortex IQ Nerve Centre

Connection Pool at >90% Saturation is one of hundreds of KPI pulses Vortex IQ tracks across PostgreSQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre