At a glance
This is the alert card that fires when your PostgreSQL connection pool crosses 90% ofmax_connectionsand holds there. In plain terms: the database is running out of places to put new connections. When the last few slots fill, new connection attempts get rejected with “too many clients already” (FATAL: sorry, too many clients already), and every application that talks to the database starts throwing connection errors at the same time. For an SRE or DBA this is one of the highest-signal alerts on the board, because pool exhaustion is the single most common cause of a total, instant database outage that looks like a crash but is not one.
| What it tracks | Whether (active + idle + idle-in-transaction) / max_connections has crossed 90% and stayed there for at least one minute. The card surfaces the breach as an alert entry, not a continuous gauge. |
| Data source | Backend count from pg_stat_activity (one row per backend) compared against the max_connections GUC. On instances fronted by PgBouncer or pgpool-II, the proxy’s own pool counters are read in addition to the server-side count. See the live gauge sibling Connection Pool Saturation % for the underlying calculation. |
| Time window | RT (real-time, evaluated on each poll, roughly every 60 seconds). |
| Alert trigger | >90% sustained 1m. A single momentary spike does not fire; the breach must hold across the one-minute evaluation window to avoid paging on transient connection storms that drain on their own. |
| Roles | dba, platform, sre |
Calculation
The breach condition is computed directly from the two numbers PostgreSQL itself exposes:SELECT count(*) FROM pg_stat_activity (all states: active, idle, idle in transaction, idle in transaction (aborted), plus background workers that count against the limit). The denominator is SHOW max_connections (or SELECT setting FROM pg_settings WHERE name = 'max_connections').
A subtlety that matters: PostgreSQL reserves a handful of slots above the application-visible limit via superuser_reserved_connections (default 3) and, from PostgreSQL 16, reserved_connections. Those reserved slots exist precisely so a DBA can still log in to fix the problem when the pool is full. The card measures saturation against the full max_connections, so a reading of “100%” still leaves the superuser slots free for you to connect and triage.
When a connection pooler is present (PgBouncer in transaction mode is the common case), there are two distinct pools to watch: the client-side pool the application fills, and the server-side pool PgBouncer maintains against PostgreSQL. The card reads both and fires on whichever crosses 90% first, because exhaustion at either layer produces the same application-visible failure.
Worked example
A platform team runs a primary PostgreSQL 15 instance on a managed service backing a fashion retailer’s order and inventory APIs.max_connections is set to 200. PgBouncer sits in front in transaction-pooling mode. Snapshot taken on 14 Apr 26 at 20:05 BST, the start of a flash-sale window.
| Metric | Reading | Threshold | State |
|---|---|---|---|
Backends in pg_stat_activity | 187 | n/a | rising |
max_connections | 200 | n/a | fixed |
| Server-side saturation | 93.5% | 90% | BREACH |
| Duration above 90% | 2m 10s | 1m | BREACH |
idle in transaction backends | 41 | n/a | suspicious |
- The pool is nearly full and not draining. At 187 of 200 with traffic still climbing, the next minute will hit the wall. New API requests are about to start failing with
FATAL: sorry, too many clients already. - 41 of those backends are idle in transaction. That is the tell. These are connections that opened a transaction (
BEGIN) and then went quiet without committing or rolling back, often because an application thread is blocked on an external call while holding a database connection open. Cross-reference Idle-in-Transaction Backends. - This is a pooling problem, not a query problem. Latency (see Query Latency p95 (ms)) is still healthy at 38ms. The database can do the work; it just cannot accept the connections.
- Pool exhaustion is almost never about traffic alone. A healthy app with correct connection handling will queue at the pooler, not flood the server. When this fires, the first suspect is a connection leak (idle-in-transaction), not “we got popular”.
- The fix is usually configuration, not a bigger box. Raising
max_connectionsto paper over a leak makes the eventual failure worse, because each backend costs memory (work_memper sort, plus per-backend overhead). The durable fix is a pooler sized correctly plusidle_in_transaction_session_timeout. - This alert is a leading indicator of a total outage. Unlike a slow query, which degrades one endpoint, a full pool takes the whole database offline for every caller at once. Treat a sustained breach as a near-page-1 event.
Sibling cards
| Card | Why pair it with Connection Pool at >90% Saturation | What the combination tells you |
|---|---|---|
| Connection Pool Saturation % | The continuous gauge this alert is built on. | The gauge shows the climb; this card marks the moment it became dangerous. |
| Idle-in-Transaction Backends | The most common cause of slow-building saturation. | High idle-in-tx plus rising saturation equals a connection leak, not real load. |
| Connections In Use | The raw backend count behind the percentage. | Confirms whether the breach is many small users or a few stuck sessions. |
| Connection Errors (24h) | The downstream symptom once the pool is full. | A spike here right after the breach confirms rejected connections reached clients. |
| Queries per Second (live) | Distinguishes a real traffic surge from a leak. | Saturation up with flat QPS equals a leak; both up together equals genuine demand. |
| PgBouncer Pool Saturation vs Traffic Burst | The cross-channel view tying pool pressure to storefront traffic. | Confirms whether the burst is paying customers or scrapers. |
| PostgreSQL Health Score | The composite that drops sharply when the pool saturates. | One sustained breach pulls the composite below its healthy band. |
Reconciling against the source
Where to look in PostgreSQL’s own tooling:RunWhy our number may legitimately differ from PostgreSQL’s own view:SELECT count(*) FROM pg_stat_activity;and compare againstSHOW max_connections;. The ratio is the saturation the card reports. Break it down by state withSELECT state, count(*) FROM pg_stat_activity GROUP BY state;to see how much of the pool is genuinely working versus idle or stuck. On PgBouncer, connect to the admin console (psql -p 6432 pgbouncer) and runSHOW POOLS;andSHOW CLIENTS;to see client-side queueing the server-side count cannot show. On a managed service (Amazon RDS / Aurora, Google Cloud SQL, Azure Database for PostgreSQL), theDatabaseConnectionsmetric in the provider’s monitoring console is the same number, sampled at the provider’s interval.
| Reason | Direction | Why |
|---|---|---|
| Poll timing | Brief lag | The card polls roughly every 60s; a pg_stat_activity query you run by hand is instantaneous. A fast spike can show on one and not the other. |
| Reserved slots | Card may read 100% while you can still connect | Saturation is measured against full max_connections; superuser_reserved_connections keeps DBA slots free above the application ceiling. |
| Pooler layer | Card may fire when the server looks calm | If PgBouncer’s client pool saturates while its server pool is small, the server-side count stays low but clients are rejected. The card reads both layers. |
| Background workers | Card slightly higher | Parallel workers, autovacuum workers, and logical replication workers occupy pg_stat_activity rows and count against the limit. |
| Managed-service sampling | Provider console smoother | RDS / Cloud SQL sample DatabaseConnections on a coarser interval and may miss a sub-minute breach the card catches. |
Known limitations / FAQs
The card fires at 90% but my app is fine. Why page on that? 90% sustained for a minute is a near-miss, not a current outage, and that is the point. At 100% you are already rejecting connections and customers are seeing errors. The alert is deliberately a leading indicator so you can act in the gap between “nearly full” and “rejecting clients”. If your workload routinely runs at 92% safely, raise the sensitivity threshold for this instance in the Sensitivity tab rather than ignoring the page. Should I just increasemax_connections to stop this firing?
Almost never as the first move. Each backend reserves memory (per-backend overhead plus work_mem for sorts and hashes), so a higher ceiling raises your worst-case memory footprint and can trade a connection outage for an out-of-memory one. The correct fix for most workloads is a connection pooler (PgBouncer in transaction mode) sized so the application never opens more server-side connections than PostgreSQL can comfortably hold, plus idle_in_transaction_session_timeout to reclaim leaked sessions.
What is the difference between this and the live saturation gauge?
Connection Pool Saturation % is a continuous gauge you glance at; it is green most of the time. This card is an alert entry that only appears when the gauge crosses 90% and holds. Use the gauge for trend, this card for “wake me up”.
My instance is behind PgBouncer. Which pool does the alert watch?
Both. PgBouncer maintains a client-facing pool and a server-facing pool against PostgreSQL. Exhaustion at either layer produces the same too many clients symptom for the application, so the card fires on whichever crosses 90% first. Check SHOW POOLS; in the PgBouncer admin console to see which layer is the bottleneck.
Connections jumped to 95% for ten seconds and the card did not fire. Is it broken?
No, that is the sustained 1m qualifier working as designed. Connection storms (a deploy restart, a pod rollout, a brief retry burst) routinely spike the count for a few seconds and drain on their own. Paging on those would be noise. The breach must hold across the one-minute evaluation window.
Does autovacuum or replication count against the pool?
Yes. Autovacuum workers, parallel query workers, and walsender processes for streaming replication all occupy rows in pg_stat_activity and count against max_connections. On a busy instance with several replicas and parallel queries enabled, the non-application share can be meaningful, so size the ceiling with headroom for them.
Can I see which application is holding the connections?
Yes, if your application sets application_name on its connections (most drivers do, or you can set it in the connection string). Run SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY 2 DESC; to attribute the pool by service. This is the fastest way to find which deploy or service introduced a leak.