At a glance
This card lines ClickHouse connection-pool saturation up against the storefront traffic burst that is driving it, row by row, so you can see when the database is about to become the bottleneck during exactly the windows that matter for revenue. Pool saturation alone tells you the database is busy; the traffic context tells you whether that busy-ness coincides with a sales-critical moment (a campaign send, a flash sale, a homepage feature). When saturation crosses 90% while traffic is bursting, queued connections start to wait, dashboards and analytics-backed storefront features stall, and the slowdown lands at the worst possible time. This is the join between an infrastructure metric and a commercial one.
| Data source | ClickHouse connection-pool saturation (active connections against the configured pool limit, from system.metrics) joined against storefront traffic burst signal over the same window, presented broken down by row. |
| What it tracks | Pool saturation percentage per interval set side by side with the concurrent traffic level, so a DBA sees both the cause (traffic) and the effect (saturation) in one table. |
| Metric basis | Real-time connection counts from system.metrics (active vs configured maximum) correlated with the traffic-burst signal; this is a correlation card, not a single counter. |
| Why it matters | Saturation at a quiet hour is a tuning note; saturation during a traffic burst is revenue at risk, because the queries powering live storefront and analytics features queue precisely when shoppers are most active. |
| Time window | 15m (a short rolling window so a burst and its saturation are caught while they are still actionable). |
| Alert trigger | >90% during traffic burst. Saturation above 90% that co-occurs with a traffic burst flags the card amber and pages the on-call DBA. |
| Roles | dba, platform, sre |
Calculation
The engine computes pool saturation as active connections over the configured pool ceiling and aligns it to the storefront traffic signal on the same time buckets:saturation_pct is the active connection count (TCP plus HTTP) as a percentage of the instance’s configured max_connections. The traffic-burst side comes from the correlated storefront connector (the same time buckets), and the card places the two next to each other per row. The alert does not fire on saturation alone: it fires only when saturation exceeds 90% and the traffic signal is in a burst state for the same bucket. That conjunction is the point. A pool at 95% at 03:00 with no traffic is a config note (perhaps a runaway batch job). A pool at 95% during a campaign-driven surge is a live commercial risk, because connection waits delay the queries behind storefront and analytics features while the most shoppers are present.
The 15-minute window is deliberately short. Bursts are transient, and a saturation spike that has already passed is a post-mortem, not an alert. Holding the window tight keeps the card focused on the saturation that is happening now, against the traffic that is happening now.
Worked example
A platform team runs a self-managed ClickHouse instance that powers live merchandising and analytics widgets for a Shopify storefront. A scheduled email campaign goes out at 11:00. Snapshot taken on 14 Apr 26 between 10:55 and 11:10 BST,max_connections configured at 200.
| Bucket (BST) | Active connections | Saturation | Traffic state | Note |
|---|---|---|---|---|
| 10:55 | 96 | 48% | normal | baseline |
| 11:00 | 142 | 71% | burst start | campaign send lands |
| 11:03 | 178 | 89% | burst | climbing fast |
| 11:05 | 194 | 97% | burst | over threshold |
| 11:08 | 188 | 94% | burst | still saturated |
- The cause is the campaign, not a leak. Saturation tracks the traffic curve exactly, rising from 48% to 97% as the campaign-driven session surge hits the storefront and every session fans out into analytics-widget queries against ClickHouse.
- The pool is the bottleneck, not the queries. Individual query latency is still acceptable; the problem is that there are not enough connection slots, so new requests queue. This shows up as a small rise in Query Latency p95 (ms) from queue wait, not from heavy queries.
- This is the revenue-critical window. The campaign exists to drive sales; if the storefront widgets stall now, the campaign’s own traffic is degraded. That is why this conjunction pages, where a quiet-hour 97% would not.
max_connections buys immediate headroom but a large enough burst will still find the ceiling. The most useful operational change is to pre-scale or pre-warm ahead of scheduled campaign sends, because the timing is known in advance.
Three takeaways:
- Saturation is only a crisis in context. 97% at a quiet hour is a tuning note; 97% during a campaign burst is revenue at risk. This card supplies the context that turns a number into a decision.
- Pool exhaustion delays queries even when the queries are fine. The fix is connection headroom or fewer connections, not query tuning, when latency rises from queue wait rather than heavy work.
- Known bursts should be pre-empted. Scheduled campaigns are predictable; pre-scaling or caching ahead of the send turns a recurring amber into a non-event.
Sibling cards
| Card | Why pair it with Pool Saturation vs Traffic Burst | What the combination tells you |
|---|---|---|
| Connection Pool Saturation % | The standalone saturation gauge without the traffic context. | This card adds the “is it a burst?” question; the gauge is the raw number. |
| Connection Pool at >90% Saturation | The Nerve Centre alert that pages on sustained saturation. | The alert is the paging surface; this card explains whether the cause is commercial traffic. |
| Connections In Use | The absolute connection count behind the percentage. | Rising connections in use plus a burst confirms traffic, not a leak, is driving saturation. |
| Query Latency p95 (ms) | The latency that queue wait inflates during saturation. | p95 rising from queue wait (not heavy queries) confirms the pool, not the workload, is the limit. |
| Queries per Second (live) | The query inflow a burst produces. | QPS spiking in step with saturation ties the pool pressure to query volume. |
| ClickHouse QPS Spike vs Ecom Order Rate | The sibling cross-channel card that separates real traffic from bot storms. | A QPS spike with no order spike means the saturation is bot-driven, not revenue-critical. |
| ClickHouse Health Score | The composite that weights pool pressure. | Sustained burst-time saturation pulls the composite down. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Read the live connection counts inWhy our number may legitimately differ from a manual query:clickhouse-client:Compare against the configured ceiling withSELECT name, value FROM system.server_settings WHERE name = 'max_connections'. For the time-bucketed view the card uses, querysystem.metric_logforCurrentMetric_TCPConnectionandCurrentMetric_HTTPConnectionover the window. On ClickHouse Cloud, the same metrics are visible in the SQL console, and the managed monitoring view surfaces connection utilisation; the traffic-burst side of this card comes from your storefront connector, not from ClickHouse, so reconcile that half against the storefront analytics.
| Reason | Direction | Why |
|---|---|---|
| Snapshot timing | Slightly higher or lower | Connection counts move continuously during a burst; a single manual read can land between the peaks the card’s bucketed max captures. |
| Which connection types counted | Card may be higher | The card sums TCP and HTTP (and native protocol) connections; a manual query that reads only TCPConnection undercounts an HTTP-heavy workload. |
| Per-node scope | Card matches its configured node | On a cluster, connections are per node; a manual query on a different replica reflects that replica only. |
| Traffic-side alignment | Conjunction may differ | The “burst” flag depends on the storefront connector’s window; if its time zone or window differs from your manual check, the co-occurrence can look offset. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
shopify.total_revenue / bigcommerce.total_revenue | A genuine traffic burst that saturates the pool should coincide with rising sessions and orders on the storefront. | Saturation bursting with no matching storefront traffic means the load is internal (a batch job or dashboard storm), not shopper-driven; treat it as a tuning issue, not revenue risk. |
| ClickHouse QPS Spike vs Ecom Order Rate | Saturation during a real burst pairs with both a QPS spike and an order spike. | A QPS spike and saturation with flat orders points at bot traffic or a runaway dashboard, which changes the response from “add capacity” to “block the source”. |
Known limitations / FAQs
Why does this card not page when saturation hits 95% overnight? By design. The alert fires on saturation above 90% only when it co-occurs with a traffic burst. A 95% reading at 03:00 with no storefront traffic is almost always an internal cause (a heavy batch job, a stuck dashboard tab) and is not a revenue risk, so it is surfaced as context rather than a page. If you also want to be paged on saturation regardless of traffic, use Connection Pool at >90% Saturation. Is high saturation the same as the database being slow? Not directly. Saturation measures how full the connection pool is. The queries themselves may still run quickly; the symptom of a full pool is that new requests queue for a connection slot, which adds wait time before the query even starts. That is why a saturation amber can coincide with a modest p95 rise from queue wait rather than from heavy query work. The traffic side looks delayed compared to the saturation side. Why? The two halves come from different systems. ClickHouse connection metrics are real time; the storefront traffic signal arrives through the storefront connector, which may have its own refresh cadence and time-zone alignment. Small offsets between the two curves are normal. The card aligns them on shared buckets, but a difference in window boundaries can make one side appear to lead the other. My pool is saturated butmax_connections looks high. What is happening?
Either the burst is genuinely large enough to fill even a high ceiling, or connections are not being returned to the pool promptly (long-running queries or a client that holds connections open). Check Top 10 Slowest Queries: a few very long queries each hold a slot for their full duration and can saturate a large pool with surprisingly little concurrency.
Should I just keep raising max_connections?
Headroom helps, but a large enough burst will always find the ceiling, and every connection still consumes server resources. The more durable fix for storefront-driven bursts is to stop each shopper session from translating directly into a ClickHouse connection: cache the widget queries so a surge in sessions does not become a surge in connections. Raise the ceiling for known campaign windows, but pair it with caching.
Does this work on ClickHouse Cloud where I do not manage connections directly?
Yes. Cloud still exposes connection metrics in system.metrics and the monitoring view, so the saturation side reads the same way. The difference is the lever: on Cloud you scale the instance or rely on managed autoscaling rather than hand-editing max_connections. The traffic-burst correlation is identical because it comes from your storefront connector.
What counts as a “traffic burst”?
The traffic-burst signal comes from the correlated storefront connector and represents a short-window surge above the recent baseline (for example a campaign send, a flash sale, or a homepage feature driving a session spike). It is the same burst concept used by the other cross-channel ClickHouse cards, which is what lets you read pool pressure, QPS, and order rate against a single shared notion of “is this a busy moment for the business?”.