At a glance
Connection Pool Saturation % is the share of available client connection slots currently in use on the ClickHouse instance. For a platform team, this is “how close are we to refusing new queries?” ClickHouse caps concurrent connections via max_connections (and the HTTP/native listener backlog). When the pool fills, new client connections are queued or rejected, so dashboards stall, ingest workers retry, and downstream services see timeouts even though CPU and disk look fine. At 90% saturation you are one traffic burst away from refused connections.
| What it tracks | The ratio of currently held connections to the configured connection ceiling, expressed as a percentage. Pulled from system.metrics (TCPConnection, HTTPConnection, MySQLConnection, PostgreSQLConnection, InterserverConnection) against the server’s max_connections setting. |
| Data source | Connection Pool Saturation % for the selected period, computed live from system.metrics connection gauges divided by the max_connections value read from system.server_settings. |
| Metric basis | Live connection count, not query count. A single connection can run many queries; a connection held open by an idle client still occupies a slot. This card measures slots, not work. |
| Aggregation window | Real-time gauge, sampled every minute (RT/1m). The headline shows the latest sample; the sparkline shows the 1-minute trend. |
| Time window | RT/1m (real-time, 1-minute sampling) |
| Alert trigger | > 90%, sustained saturation above 90% pages the platform on-call because connection refusals are imminent. |
| What counts | All active client-facing connections (native TCP on 9000, HTTP on 8123, plus MySQL/PostgreSQL wire-protocol listeners if enabled) and interserver connections. |
| What does NOT count | Closed/idle-reaped connections, background merge threads, and replication fetches that do not occupy a client connection slot. |
| Roles | owner, engineering, operations |
Calculation
The engine reads the current connection gauges fromsystem.metrics and divides by the configured ceiling:
max_connections (default 1024 on self-managed builds, often tuned higher on ClickHouse Cloud services). The card refreshes the sample every 60 seconds. On ClickHouse Cloud the ceiling is set by the service tier rather than a directly editable setting, so the engine reads the effective limit reported by the service. See the At a glance summary for what the metric tracks and the worked example below for a typical reading.
Worked example
A DBA team runs a 3-node ClickHouse cluster backing a real-time analytics product.max_connections is set to 1024 per node. The application uses a connection pool of 200 per app instance, with 6 app instances, plus a fleet of BI dashboards that each hold a long-lived HTTP connection. Snapshot taken on 14 Apr 26 at 09:42 BST during the morning reporting peak.
| Connection type | Live count | Notes |
|---|---|---|
TCPConnection (native) | 612 | App pool plus ingest workers |
HTTPConnection | 318 | BI dashboards, ad-hoc analysts |
InterserverConnection | 21 | Replication and distributed query fan-out |
| Total in use | 951 | |
max_connections | 1024 |
-
The headline is a leading indicator, not a failure yet. At 92.9% the server is still serving every connection. But the next dashboard refresh wave (BI tools tend to refresh on the hour) will push it past 1024, at which point native clients get
DB::Exception: Too many simultaneous queries / connectionsand HTTP clients get connection resets. The team has minutes, not hours. -
Idle dashboard connections are the cheapest win. 318 HTTP connections for a team of 40 analysts means roughly 8 long-lived connections per analyst, most idle. Lowering the BI tool’s pool size or enabling idle-connection reaping (
idle_connection_timeout) frees slots without touching the application. - Pool saturation rarely tracks CPU. Check Memory Usage % and Queries per Second (live) alongside this card. If QPS is flat but saturation is climbing, the problem is connection leakage (clients opening connections and not returning them to the pool), not load. If QPS is spiking too, it is genuine demand and you should scale the connection ceiling or add a node.
max_connections if RAM allows (each connection has a modest memory cost), or (b) shed idle connections by tightening client-side pool limits and idle timeouts, or (c) front the cluster with a connection-pooling proxy (such as chproxy) so thousands of clients share a bounded set of server connections.
Sibling cards platform teams should reference together
| Card | Why pair it with Connection Pool Saturation | What the combination tells you |
|---|---|---|
| Connections In Use | The raw numerator behind this percentage. | Absolute count plus ceiling tells you exactly how many free slots remain, not just the ratio. |
| Connection Pool at >90% Saturation | The alert-list companion that records each breach. | A single spike is noise; repeated breaches in the alert list mean a structural capacity problem. |
| Queries per Second (live) | Demand context for the saturation. | Saturation rising with QPS equals genuine load; saturation rising with flat QPS equals connection leakage. |
| Memory Usage % | Each connection costs memory; raising the ceiling has a memory cost. | Tells you whether you have headroom to raise max_connections safely. |
| Query Latency p95 (ms) | The downstream symptom when the pool is contended. | Latency climbing alongside saturation means clients are queuing for connection slots. |
| ClickHouse Health Score | The composite that weights saturation as a capacity input. | Sustained saturation drags the overall health score down. |
| ClickHouse Pool Saturation vs Traffic Burst | The cross-channel view tying saturation to storefront traffic. | Confirms whether a saturation spike lines up with a real demand burst or a runaway client. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Why our number may legitimately differ from a direct query:system.metricsfor the live connection gauges. RunSELECT metric, value FROM system.metrics WHERE metric LIKE '%Connection%'to see every connection counter the server exposes.system.server_settingsto confirm the effectivemax_connectionsceiling:SELECT name, value, changed FROM system.server_settings WHERE name = 'max_connections'.SHOW PROCESSLISTorsystem.processesto see what each live connection is actually doing right now. ClickHouse Cloud console (managed service): the Metrics tab surfaces connection counts per service; the ceiling is governed by the service tier rather than a user-editable setting.
| Reason | Direction | Why |
|---|---|---|
| Sampling lag | Brief gaps | The card samples every 60 seconds; a system.metrics query you run by hand reflects the exact instant, which may differ from the last sample. |
| Per-node vs cluster | Variable | On a multi-node cluster the card reports the worst-case node by default; a single-node query reflects only that node. |
| Ceiling source on Cloud | Variable | On ClickHouse Cloud max_connections is not always directly readable; the engine uses the service’s effective limit, which the console may display differently. |
| Interserver connections | Our number slightly higher | The card includes InterserverConnection in the numerator; some manual queries count only client-facing listeners. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| ClickHouse Pool Saturation vs Traffic Burst | Saturation spikes should line up with storefront traffic bursts. | Saturation high with flat traffic means an internal client leak, not shopper demand. |
| Storefront traffic / order-rate cards | A genuine demand surge raises both saturation and order rate together. | Saturation alone, with no order surge, points at a dashboard storm or runaway BI job. |
Known limitations / FAQs
My CPU and disk look fine but this card is red. How can the server be saturated? Connection saturation is independent of compute. The pool measures slots, not work. A few hundred idle BI dashboard connections can fill the pool while CPU sits at 10%. The fix is not more compute; it is fewer held connections (tighten client pools, enable idle reaping) or a higher ceiling. What is the difference between connection saturation and concurrent-query limits?max_connections caps open connections; max_concurrent_queries caps queries running at once. You can hit either independently. A client can hold a connection without running a query (idle), or one connection can submit many queries. This card tracks the connection ceiling; concurrency limits surface as query-side errors instead.
How do I safely raise max_connections?
Each connection carries a memory cost (thread stack plus buffers). Before raising the ceiling, check Memory Usage %. On self-managed builds, edit max_connections in the server config and reload; on ClickHouse Cloud the ceiling is tied to the service tier, so you scale the service rather than the setting. A connection-pooling proxy (chproxy) is often a better answer than a higher ceiling because it bounds server connections regardless of client count.
Does this card cover the HTTP interface as well as native?
Yes. The numerator sums TCPConnection (native, port 9000), HTTPConnection (port 8123), and the MySQL/PostgreSQL wire-protocol listeners if you have them enabled, plus interserver connections. If your fleet is HTTP-heavy (most BI tools), the HTTPConnection gauge usually dominates.
On ClickHouse Cloud I cannot find max_connections. What is the denominator?
ClickHouse Cloud manages the connection ceiling per service tier, so it is not always a directly editable setting. The card uses the effective limit reported by the service. If you need more headroom on Cloud, scale the service up rather than editing a config value.
The alert fired once at 91% then cleared. Should I worry?
A single brief spike to 91% that clears on its own is usually a refresh wave, not a problem. The alert is tuned to sustained saturation above 90% for a full minute. Use the Connection Pool at >90% Saturation alert list to see whether breaches are isolated or recurring; recurring breaches mean you are running too close to the ceiling and should add headroom.
Why does the multi-node cluster show one number when nodes differ?
By default the card reports the worst-case (highest-saturation) node, because the cluster refuses connections when any single node fills. To see per-node detail, query system.metrics on each node directly or use the cluster breakdown in the Cloud console.