At a glance
The share of the cluster’s HTTP connection capacity currently in use, expressed as a percentage. Every client request (search, indexing, health check) arrives over an HTTP connection on the REST layer. When open connections approach the configured ceiling, new clients are refused at the door before any query even runs. This is a leading indicator of a client-side connection storm, a leaking client pool, or a traffic burst the cluster’s front door cannot accept, and it bites well before CPU or heap do.
| API basis | Node HTTP stats, GET /_nodes/stats/http (http.current_open per node) measured against the connection ceiling (http.max_content_length is unrelated; the relevant cap is http.max_open where set, otherwise the OS/file-descriptor limit and any load-balancer pool size). Saturation = current_open / capacity. |
| Metric basis | A ratio, not a raw count. The card takes the busiest node’s open-connection fraction so a single saturated coordinating node is not hidden by a fleet average. |
| Aggregation window | Real-time, evaluated on a 1m rolling basis (RT/1m) so a one-second spike does not flap the gauge. |
| Alert threshold | > 90%. At 90% the cluster is within a hair of refusing connections; the gauge turns red and the on-call SRE is paged. |
| Why a gauge | Saturation is a bounded 0 to 100% value with a clear danger zone, so it renders as a gauge rather than a trend line. The needle in the red band is the signal. |
| What counts | Open HTTP/REST connections on each node’s transport-to-client layer, including keep-alive connections held idle by clients. |
| What does NOT count | The inter-node transport layer (port 9300/9301), which carries cluster-internal traffic and is tracked separately, and search/write thread-pool queues (those are downstream of the connection, not the connection itself). |
| Time window | RT/1m (real-time, smoothed over a 1-minute window) |
| Alert trigger | > 90%, the front door is nearly full and new clients will start being refused. |
| Roles | platform, sre, dba |
Calculation
For each node the engine readshttp.current_open from GET /_nodes/stats/http and divides it by that node’s effective connection capacity:
connection_capacity is the lowest binding ceiling in the path: an explicit http.max_open if configured, otherwise the process file-descriptor limit (often the real cap on Linux), and in front of the cluster the connection-pool size of any load balancer or proxy. The card reports the worst-case node because connection exhaustion is almost always uneven: coordinating nodes and whichever node the load balancer favours saturate first.
A 1-minute smoothing window is applied before the gauge updates so that brief connection churn (a deploy that briefly opens and closes pools) does not flap the needle into the red. The > 90% alert is deliberately set below 100% because at full saturation the symptom is already user-visible: clients receive connection-refused or timeout errors rather than slow responses, which is harder to diagnose than a gauge that warned you at 90%.
Worked example
A platform team runs a 4-node Elasticsearch cluster behind an application that powers on-site search for a homeware retailer. The connection ceiling per node is the OS file-descriptor limit of 65,536, but the application’s HTTP client pool is sized at 200 connections per app instance across 30 app instances, so 6,000 client connections is the realistic working maximum. On 22 May 26 at 19:40, during an evening promo, the HTTP Connection Saturation gauge climbs from a steady 35% to 93% and trips red. PullingGET /_nodes/stats/http:
| node | http.current_open | role |
|---|---|---|
| es-coord-1 | 5,580 | coordinating (LB-favoured) |
| es-data-1 | 410 | data |
| es-data-2 | 405 | data |
| es-data-3 | 398 | data |
es-coord-1 alone is holding 5,580 of the app’s 6,000-connection budget. The load balancer is pinning almost all client traffic to one coordinating node instead of spreading it.
es-coord-1 from the load-balancer pool for 30 seconds so connections redistribute, dropping the gauge to 58%. Structurally, they fix the LB algorithm to genuinely least-connections and set the application client’s idle-connection TTL to 60 seconds so leaked keep-alives are reclaimed. By 19:55 the gauge sits at a healthy 41% and is evenly spread across all four nodes.
Three takeaways:
- Saturation is a front-door metric, not a workload metric. The cluster had ample CPU and heap throughout. The failure was purely about accepting connections, which is exactly why this card pages before the resource cards do.
- The worst-case node is the truth. A fleet average of
(5,580+410+405+398)/4 ≈ 1,698would have looked calm. Reporting the busiest node exposed the lopsided load balancer. - Connection refusal is a worse user experience than slowness. A saturated front door returns hard errors, which shoppers read as “broken”, whereas a slow query at least returns results. Catching it at 90% buys time to redistribute before any client is refused.
Sibling cards
| Card | Why pair it with HTTP Connection Saturation | What the combination tells you |
|---|---|---|
| HTTP Connections In Use | The raw count behind the percentage. | The gauge tells you “how full”; the count tells you “which node and how many” so you can act. |
| Search Queries per Second (live) | The traffic that opens the connections. | Rising QPS with rising saturation is a real burst; flat QPS with rising saturation is a leaking client pool. |
| Search Error Rate % | The downstream symptom once the door is full. | Saturation at 100% plus a spiking error rate equals connection-refused errors reaching clients. |
| Search Latency p95 (ms) | The other thing clients feel under load. | High saturation with high p95 means the cluster is both full at the door and slow inside. |
| JVM Heap Used % | Rules in or out a resource cause. | High saturation with calm heap confirms a connection problem, not a workload one. |
| Circuit Breaker Trips (24h) | The cluster’s own overload defence. | Saturation plus breaker trips means the cluster is shedding load to protect itself. |
| ES Search Pool Saturation vs Ecom Burst | The cross-channel framing against storefront traffic. | Correlates this gauge with a live ecommerce traffic spike to size revenue risk. |
Reconciling against the source
Where to look in Elasticsearch itself:Why our number may legitimately differ from a manual reading:GET /_nodes/stats/httpreturnshttp.current_openandhttp.total_openedper node; this is the exact source. The cat equivalent for a quick scan isGET /_cat/nodes?v&h=name,http.current_open.GET /_nodes/_all/settings?filter_path=**.httpconfirms any configuredhttp.max_openand related HTTP settings so you know the denominator. On the host,ss -sorlsof -p <es_pid> | wc -lshows the OS-level socket and file-descriptor count, andcat /proc/<es_pid>/limitsshows the file-descriptor ceiling that is often the real cap.
| Reason | Direction | Why |
|---|---|---|
| Denominator choice | Either | The card uses the lowest binding ceiling (LB pool, http.max_open, or FD limit). If you compute the percentage against a different ceiling, your number will differ. |
| Worst-node vs average | Card higher | We report the busiest node; a fleet average looks calmer when load is uneven. |
| 1-minute smoothing | Card steadier | A raw current_open you catch mid-spike can read higher than the smoothed gauge. |
| Load balancer in front | Either | A proxy or LB terminates and re-opens connections, so the cluster’s current_open may not match what you see at the edge. Check both layers. |
| Managed service limits | Either | Elastic Cloud and AWS-managed offerings impose their own per-tier connection limits that may be lower than the node FD limit. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Search Queries per Second (live) | Saturation should track QPS during genuine bursts. | Saturation rising while QPS is flat is the classic signature of a client-side connection leak. |
| Search Error Rate % | Errors should stay near zero until saturation nears 100%. | Errors climbing well below 100% saturation points at a different cause (query failures, mapping issues). |
Known limitations / FAQs
The gauge is at 92% but CPU and heap are low. Is that a problem? Yes, and it is exactly the problem this card exists to catch. Connection saturation is independent of workload: the cluster can be nearly idle internally yet unable to accept new clients because the connection slots are full (often from leaked keep-alive connections). At 100% new clients are refused outright. Treat a red gauge as urgent even when the resource cards look calm. Why does the card show the busiest node instead of an average? Because connection exhaustion is almost always uneven. Coordinating nodes and whichever node a load balancer favours saturate first while the rest sit idle. A fleet average would hide a single node at 100% behind three nodes at 10%. We report the worst-case node so the gauge fires when any single front door is about to refuse clients. What is the difference between this and the inter-node transport layer? HTTP/REST connections (the ones this card tracks) are how external clients talk to the cluster, typically on port 9200. The transport layer (port 9300) carries cluster-internal traffic between nodes: shard data, cluster-state publishing, search fan-out. Saturating the HTTP layer refuses clients; saturating the transport layer degrades the cluster internally. They are separate ceilings and separate problems. Saturation keeps creeping up over days even though traffic is flat. Why? That is the signature of a client-side connection leak: an application HTTP client that opens keep-alive connections but never trims idle ones, so the open count ratchets upward until it hits the ceiling. Fix it on the client by setting a sane idle-connection TTL and a bounded pool size, and confirm withhttp.total_opened rising far faster than expected for the traffic. Restarting the offending app instance is the quick mitigation.
Can I just raise the connection limit to make the alert go away?
Raising http.max_open or the OS file-descriptor limit treats the symptom, not the cause, and on a leak it only delays exhaustion. Raise the ceiling only when you have confirmed legitimate growth in concurrent clients. For a leak, fix the client pool. For uneven load, fix the load balancer. The limit should reflect real, healthy demand plus headroom, not be inflated to silence a warning.
Does a managed service (Elastic Cloud, AWS) change how I read this?
The metric means the same thing, but the binding ceiling may be the provider’s per-tier connection limit rather than the node file-descriptor limit, and that limit can be lower than you expect. On managed tiers, check the provider’s documented connection cap for your instance size and treat that as the denominator. Scaling up an instance class is sometimes the only way to raise the limit on a managed plan.
The gauge is red but no clients are reporting errors. False alarm?
Not necessarily. The 90% alert is intentionally early so you can act before the door fills. At 90% you still have ~10% headroom, so clients are not yet refused; the gauge is warning you that one more traffic step would tip it over. Use the window to redistribute load or trim leaked connections rather than waiting for the first connection-refused error.