Connection Pool at >90% Saturation, ClickHouse

Card class: Hero • Category: Nerve Centre

At a glance

The alert feed for moments when the ClickHouse connection pool runs hot: concurrent connections sit above 90% of the configured max_connections ceiling for a sustained minute. When the pool saturates, new client connections are refused (you see Too many simultaneous queries or connection timeouts), so dashboards stall and ingest jobs back up. This card lists each saturation event with its start time, peak utilisation, and duration, so the on-call DBA can see whether the cause was a query storm, a connection leak, or genuine capacity demand.


Data source	Concurrent connection count from `system.metrics` (the `TCPConnection` and `HTTPConnection` gauges) and `system.asynchronous_metrics`, compared against the server’s `max_connections` setting. Each breach of the 90% ratio sustained for one minute is recorded as one alert row.
What it tracks	Saturation events, not a continuous gauge. The live percentage lives on Connection Pool Saturation %; this card is the incident history of when that gauge crossed the danger line.
Why it matters	At >90% the pool has almost no headroom. The next connection burst (a deploy, a BI tool refreshing every panel at once, a retry loop) tips it to 100% and ClickHouse begins refusing connections. Refused connections are invisible to ClickHouse’s own query metrics because the query never starts, so this card is often the only place the event is visible.
Time window	`RT` (real-time; the detector evaluates the live ratio every cycle).
Alert trigger	`>90% sustained 1m`. Utilisation must hold above 90% of `max_connections` for a full minute to fire, which filters out harmless one-second spikes.
Roles	dba, platform, sre

Calculation

The detector computes pool utilisation as live concurrent connections over the configured ceiling:

saturation_pct = (TCPConnection + HTTPConnection + MySQLConnection + PostgreSQLConnection)
                 / max_connections * 100

The connection gauges come from system.metrics:

SELECT metric, value
FROM system.metrics
WHERE metric IN ('TCPConnection','HTTPConnection','MySQLConnection','PostgreSQLConnection')

and the ceiling from the server configuration:

SELECT value FROM system.server_settings WHERE name = 'max_connections'

An alert row is created when saturation_pct > 90 holds continuously for 60 seconds. The “sustained 1 minute” gate is deliberate: connection counts are spiky by nature (every client connect/disconnect moves the gauge), and a momentary touch of 91% is not an incident. A full minute above the line means the pool is genuinely starved of headroom, not just briefly busy. Each row records the start time, the peak utilisation reached, and the duration until utilisation fell back below 90%.

Worked example

A platform team runs a ClickHouse instance with max_connections = 1000 serving both an ingest pipeline and a fleet of BI dashboards. Snapshot of the alert feed on 22 Apr 26.

Started	Peak utilisation	Duration	Trigger
22 Apr 26 08:31 BST	97% (974 / 1000)	6m 12s	Monday 08:30 dashboard refresh storm
21 Apr 26 14:02 BST	93% (931 / 1000)	2m 40s	BI tool retry loop after a slow query
19 Apr 26 23:48 BST	99% (994 / 1000)	11m 30s	Connection leak in a reporting service

The Nerve Centre headline reads 1 active saturation alert, peak 97%, outlined red. The DBA reads the feed:

The 08:31 event is a recurring pattern. It lands every weekday at the same minute. This is the classic “everyone opens their dashboard at the start of the day” storm: dozens of BI panels each opening a connection simultaneously. It self-resolves in minutes but each occurrence risks refused connections at the peak.
The 19 Apr event is the dangerous one. It ran for 11.5 minutes and peaked at 99%. A duration that long with no obvious traffic trigger points to a connection leak: a service opening connections and not closing them. Left unchecked, the next one hits 100% and starts refusing connections.
None of these show up in query latency. A refused connection never becomes a query, so Query Latency p95 (ms) can look perfectly healthy while clients are being turned away. This card is the only place the impact is visible.

Triaging the 19 Apr leak:
  - Confirm live connection sources:
      SELECT client_name, count() FROM system.processes GROUP BY client_name ORDER BY count() DESC
  - A single client holding hundreds of idle connections = leak
  - Short-term: raise max_connections to add headroom (buys time, not a fix)
  - Real fix: enable connection pooling / reuse in the offending client,
              or set an idle-connection timeout so leaked connections drop

For the recurring 08:31 storm the fix is on the client side: stagger dashboard auto-refresh schedules or put a connection pool in front of the BI tools so panels share connections instead of each opening their own. Raising max_connections adds headroom but does not stop a genuine leak from eventually exhausting any ceiling. Three takeaways:

Saturation is invisible in query metrics. Refused connections never become queries, so latency and error-rate cards can look green while clients are locked out. Trust this card as the canonical view of pool starvation.
Duration distinguishes traffic from leaks. A short spike at a predictable time is a refresh storm; a long, slowly climbing event with no traffic trigger is a leak. Read the duration column, not just the peak.
Headroom is the real metric. A pool living at 88% is one burst away from refusing connections. Pair this alert with Connection Pool Saturation % and keep steady-state utilisation well below 70%.

Sibling cards

Card	Why pair it with this alert	What the combination tells you
Connection Pool Saturation %	The live gauge this alert watches.	The gauge shows current headroom; this card shows when headroom ran out.
Connections In Use	The raw concurrent-connection count.	Connections climbing with no query growth equals a leak, not load.
Queries per Second (live)	The query throughput driving connections.	Saturation with flat QPS equals connections held open without doing work.
ClickHouse Pool Saturation vs Traffic Burst	The cross-channel view tying saturation to storefront traffic.	Saturation during a traffic burst equals genuine demand; saturation without one equals an internal problem.
Query Error Rate %	Errors from queries that did start.	Saturation plus rising errors equals the pool refusing some clients while others time out mid-query.
Failed Queries (24h)	The 24-hour failure tally.	Spikes here aligned with saturation windows confirm the pool is the cause.
ClickHouse Health Score	The composite that weights pool saturation.	A sustained saturation event drags the composite below 70.

Reconciling against the source

Where to look in ClickHouse’s own tooling:

Check live connection counts against the ceiling in clickhouse-client:
SELECT metric, value FROM system.metrics WHERE metric LIKE '%Connection';
SELECT value FROM system.server_settings WHERE name = 'max_connections';
See who is connected and what they are running in system.processes (one row per active query/connection), grouped by client_name or user to find a leak. On ClickHouse Cloud, max_connections is managed per service tier; the same system.metrics query works in the SQL console, and the managed monitoring view shows connection utilisation over time.

Why our number may legitimately differ from a manual check:

Reason	Direction	Why
Snapshot timing	Either	Connection gauges move every connect/disconnect. A manual query a few seconds after the alert peak will read lower as the burst clears.
Sustained-minute gate	Card lower (fewer events)	The card only records breaches held for a full minute; a manual `watch` will catch sub-minute spikes the card intentionally ignores.
Protocol mix	Card may be higher	The card sums TCP, HTTP, MySQL, and PostgreSQL connection gauges; checking only `TCPConnection` understates the total.
max_connections source	Either	The card reads the live `system.server_settings` value; a stale config file on disk may show a different ceiling than what the server is actually running.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
ClickHouse Pool Saturation vs Traffic Burst	Saturation events should align with storefront traffic peaks.	Saturation with no matching traffic burst points to an internal cause (leak, retry loop, dashboard storm) rather than genuine demand.

Known limitations / FAQs

My query latency looks fine but this card is red. How can both be true? Easily, and it is the most important thing to understand about pool saturation. When the pool is full, ClickHouse refuses new connections before any query runs. Those refused clients never produce a query, so they never appear in latency or query-error metrics. The queries that did get in run normally. So latency stays green while a portion of your clients are locked out entirely. This card is the only place the lockout is visible. Why does it need a full minute above 90%? I want to catch every spike. Connection counts are inherently spiky: every client connect and disconnect moves the gauge, and a healthy server touches high utilisation briefly all the time. Firing on every momentary spike would bury you in noise. The one-minute sustain gate means the pool was genuinely starved of headroom, not just briefly busy. If you need finer sensitivity, the threshold is configurable in the Sensitivity tab. Should I just raise max_connections to make the alerts stop? Only as a stopgap. Raising the ceiling adds headroom and will silence storm-driven alerts, but it does nothing for a connection leak: a leaking client will eventually exhaust any ceiling, just more slowly. Diagnose first. If system.processes shows one client holding hundreds of idle connections, fix the client’s pooling or set an idle timeout. If it is genuine concurrent demand, then raising the ceiling (and the underlying instance size) is the right call. What is the difference between this and Connection Pool Saturation %? Connection Pool Saturation % is the live gauge: where utilisation sits right now. This card is the incident log: the list of times utilisation crossed 90% and stayed there. Use the gauge for current state and this card for history and root-cause timing. Does this count connections from all protocols? Yes. The detector sums the native TCP protocol plus HTTP, and the MySQL and PostgreSQL wire-protocol gauges if those interfaces are enabled. They all draw from the same max_connections ceiling, so counting only one protocol would understate true saturation. On ClickHouse Cloud I do not set max_connections myself. Does this still apply? Yes. ClickHouse Cloud sets max_connections per service tier, and you can still saturate it with too many concurrent clients or a leak. The card reads the live limit and live connection counts the same way, and the same diagnosis (find the offending client in system.processes, fix pooling, or scale the service) applies.

Tracked live in Vortex IQ Nerve Centre

Connection Pool at >90% Saturation is one of hundreds of KPI pulses Vortex IQ tracks across ClickHouse and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre