At a glance
The alert feed for moments when the ClickHouse connection pool runs hot: concurrent connections sit above 90% of the configuredmax_connectionsceiling for a sustained minute. When the pool saturates, new client connections are refused (you seeToo many simultaneous queriesor connection timeouts), so dashboards stall and ingest jobs back up. This card lists each saturation event with its start time, peak utilisation, and duration, so the on-call DBA can see whether the cause was a query storm, a connection leak, or genuine capacity demand.
| Data source | Concurrent connection count from system.metrics (the TCPConnection and HTTPConnection gauges) and system.asynchronous_metrics, compared against the server’s max_connections setting. Each breach of the 90% ratio sustained for one minute is recorded as one alert row. |
| What it tracks | Saturation events, not a continuous gauge. The live percentage lives on Connection Pool Saturation %; this card is the incident history of when that gauge crossed the danger line. |
| Why it matters | At >90% the pool has almost no headroom. The next connection burst (a deploy, a BI tool refreshing every panel at once, a retry loop) tips it to 100% and ClickHouse begins refusing connections. Refused connections are invisible to ClickHouse’s own query metrics because the query never starts, so this card is often the only place the event is visible. |
| Time window | RT (real-time; the detector evaluates the live ratio every cycle). |
| Alert trigger | >90% sustained 1m. Utilisation must hold above 90% of max_connections for a full minute to fire, which filters out harmless one-second spikes. |
| Roles | dba, platform, sre |
Calculation
The detector computes pool utilisation as live concurrent connections over the configured ceiling:system.metrics:
saturation_pct > 90 holds continuously for 60 seconds. The “sustained 1 minute” gate is deliberate: connection counts are spiky by nature (every client connect/disconnect moves the gauge), and a momentary touch of 91% is not an incident. A full minute above the line means the pool is genuinely starved of headroom, not just briefly busy. Each row records the start time, the peak utilisation reached, and the duration until utilisation fell back below 90%.
Worked example
A platform team runs a ClickHouse instance withmax_connections = 1000 serving both an ingest pipeline and a fleet of BI dashboards. Snapshot of the alert feed on 22 Apr 26.
| Started | Peak utilisation | Duration | Trigger |
|---|---|---|---|
| 22 Apr 26 08:31 BST | 97% (974 / 1000) | 6m 12s | Monday 08:30 dashboard refresh storm |
| 21 Apr 26 14:02 BST | 93% (931 / 1000) | 2m 40s | BI tool retry loop after a slow query |
| 19 Apr 26 23:48 BST | 99% (994 / 1000) | 11m 30s | Connection leak in a reporting service |
- The 08:31 event is a recurring pattern. It lands every weekday at the same minute. This is the classic “everyone opens their dashboard at the start of the day” storm: dozens of BI panels each opening a connection simultaneously. It self-resolves in minutes but each occurrence risks refused connections at the peak.
- The 19 Apr event is the dangerous one. It ran for 11.5 minutes and peaked at 99%. A duration that long with no obvious traffic trigger points to a connection leak: a service opening connections and not closing them. Left unchecked, the next one hits 100% and starts refusing connections.
- None of these show up in query latency. A refused connection never becomes a query, so Query Latency p95 (ms) can look perfectly healthy while clients are being turned away. This card is the only place the impact is visible.
max_connections adds headroom but does not stop a genuine leak from eventually exhausting any ceiling.
Three takeaways:
- Saturation is invisible in query metrics. Refused connections never become queries, so latency and error-rate cards can look green while clients are locked out. Trust this card as the canonical view of pool starvation.
- Duration distinguishes traffic from leaks. A short spike at a predictable time is a refresh storm; a long, slowly climbing event with no traffic trigger is a leak. Read the duration column, not just the peak.
- Headroom is the real metric. A pool living at 88% is one burst away from refusing connections. Pair this alert with Connection Pool Saturation % and keep steady-state utilisation well below 70%.
Sibling cards
| Card | Why pair it with this alert | What the combination tells you |
|---|---|---|
| Connection Pool Saturation % | The live gauge this alert watches. | The gauge shows current headroom; this card shows when headroom ran out. |
| Connections In Use | The raw concurrent-connection count. | Connections climbing with no query growth equals a leak, not load. |
| Queries per Second (live) | The query throughput driving connections. | Saturation with flat QPS equals connections held open without doing work. |
| ClickHouse Pool Saturation vs Traffic Burst | The cross-channel view tying saturation to storefront traffic. | Saturation during a traffic burst equals genuine demand; saturation without one equals an internal problem. |
| Query Error Rate % | Errors from queries that did start. | Saturation plus rising errors equals the pool refusing some clients while others time out mid-query. |
| Failed Queries (24h) | The 24-hour failure tally. | Spikes here aligned with saturation windows confirm the pool is the cause. |
| ClickHouse Health Score | The composite that weights pool saturation. | A sustained saturation event drags the composite below 70. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Check live connection counts against the ceiling inWhy our number may legitimately differ from a manual check:clickhouse-client:See who is connected and what they are running insystem.processes(one row per active query/connection), grouped byclient_nameoruserto find a leak. On ClickHouse Cloud,max_connectionsis managed per service tier; the samesystem.metricsquery works in the SQL console, and the managed monitoring view shows connection utilisation over time.
| Reason | Direction | Why |
|---|---|---|
| Snapshot timing | Either | Connection gauges move every connect/disconnect. A manual query a few seconds after the alert peak will read lower as the burst clears. |
| Sustained-minute gate | Card lower (fewer events) | The card only records breaches held for a full minute; a manual watch will catch sub-minute spikes the card intentionally ignores. |
| Protocol mix | Card may be higher | The card sums TCP, HTTP, MySQL, and PostgreSQL connection gauges; checking only TCPConnection understates the total. |
| max_connections source | Either | The card reads the live system.server_settings value; a stale config file on disk may show a different ceiling than what the server is actually running. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| ClickHouse Pool Saturation vs Traffic Burst | Saturation events should align with storefront traffic peaks. | Saturation with no matching traffic burst points to an internal cause (leak, retry loop, dashboard storm) rather than genuine demand. |
Known limitations / FAQs
My query latency looks fine but this card is red. How can both be true? Easily, and it is the most important thing to understand about pool saturation. When the pool is full, ClickHouse refuses new connections before any query runs. Those refused clients never produce a query, so they never appear in latency or query-error metrics. The queries that did get in run normally. So latency stays green while a portion of your clients are locked out entirely. This card is the only place the lockout is visible. Why does it need a full minute above 90%? I want to catch every spike. Connection counts are inherently spiky: every client connect and disconnect moves the gauge, and a healthy server touches high utilisation briefly all the time. Firing on every momentary spike would bury you in noise. The one-minute sustain gate means the pool was genuinely starved of headroom, not just briefly busy. If you need finer sensitivity, the threshold is configurable in the Sensitivity tab. Should I just raisemax_connections to make the alerts stop?
Only as a stopgap. Raising the ceiling adds headroom and will silence storm-driven alerts, but it does nothing for a connection leak: a leaking client will eventually exhaust any ceiling, just more slowly. Diagnose first. If system.processes shows one client holding hundreds of idle connections, fix the client’s pooling or set an idle timeout. If it is genuine concurrent demand, then raising the ceiling (and the underlying instance size) is the right call.
What is the difference between this and Connection Pool Saturation %?
Connection Pool Saturation % is the live gauge: where utilisation sits right now. This card is the incident log: the list of times utilisation crossed 90% and stayed there. Use the gauge for current state and this card for history and root-cause timing.
Does this count connections from all protocols?
Yes. The detector sums the native TCP protocol plus HTTP, and the MySQL and PostgreSQL wire-protocol gauges if those interfaces are enabled. They all draw from the same max_connections ceiling, so counting only one protocol would understate true saturation.
On ClickHouse Cloud I do not set max_connections myself. Does this still apply?
Yes. ClickHouse Cloud sets max_connections per service tier, and you can still saturate it with too many concurrent clients or a leak. The card reads the live limit and live connection counts the same way, and the same diagnosis (find the offending client in system.processes, fix pooling, or scale the service) applies.