Search Error Rate Spike (>1% in 5m), Elasticsearch

Card class: Hero • Category: Nerve Centre

At a glance

An alert card that fires when the share of search requests returning an error climbs above 1% and stays there for 5 minutes. This is the storefront-facing failure signal: when search errors spike, real users are getting empty or broken search result pages right now. A 1% error rate sounds small, but on a busy catalogue it means hundreds of failed searches an hour, and search is often the highest-intent path on an ecommerce site. The card answers one question directly: are shoppers’ searches failing, and how badly?


What it tracks	The proportion of search requests that fail (return a shard-failure, a timeout, a circuit-breaking rejection, or an HTTP 5xx/429 from the search endpoint) out of all search requests in the window.
Data source	Search totals and failure counts derived from `indices.search.query_total` and shard-failure / rejected counters in `GET /_nodes/stats/indices/search` and `GET /_nodes/stats/thread_pool`, cross-checked against search-thread-pool `rejected`. Detail: “Alerts for Search Error Rate Spike (>1% in 5m).”
Time window	`5m`. The error rate is computed over a rolling 5-minute window and must stay above threshold for the duration.
Alert trigger	`error rate > 1% sustained 5m`. The sustain avoids paging on a single bad second; a real degradation holds above 1%.
What counts as an error	Shard failures (a shard returned an error or was unavailable), search-thread-pool rejections (429), circuit-breaking exceptions, query timeouts, and 5xx responses from the search path.
What does NOT count	Zero-result searches (a query that legitimately matches nothing is a success, not an error), and slow-but-successful queries (those belong to the latency and slow-query cards).
Roles	platform, SRE, DBA, on-call

Calculation

The error rate is failed searches divided by total searches over the rolling 5-minute window, expressed as a percentage:

search_error_rate (%) = (failed_search_requests / total_search_requests) * 100
                        over a rolling 5-minute window

total_search_requests is the delta of indices.search.query_total across the window. failed_search_requests is the sum of:

Shard failures: a request where one or more shards returned an error or were unavailable (the _shards.failed count in responses, reflected in node stats). On a red or recovering cluster these climb because the missing shard cannot answer.
Thread-pool rejections: the delta of thread_pool.search.rejected. When the search queue is full, Elasticsearch returns 429 and the request is rejected, not queued forever.
Circuit-breaking exceptions: a request rejected because answering it would exceed a memory breaker limit.
Timeouts and 5xx: requests that exceeded their timeout or failed at the HTTP layer.

The engine evaluates the ratio on each poll, starts a timer when it first crosses 1%, and fires only if the rate is still above 1% after 5 continuous minutes. If the rate falls back under 1% inside the window, the timer resets. This is a sustained-degradation alert, not a single-blip alert, because search error rates are naturally noisy at the per-second level.

Worked example

A platform team runs Elasticsearch behind the search bar of a fashion retailer doing roughly 40 searches per second at peak. Snapshot taken on 19 Apr 26 at 12:05 BST, during the lunchtime traffic peak. A marketing email went out at 12:00 driving a burst of traffic. The search thread pool, sized for normal load, started rejecting requests once its queue filled. Over the 5-minute window the engine sees:

Window metric	Value
Total search requests	12,400
Shard failures	0
Search-pool rejections (429)	360
Circuit-breaking exceptions	0
Timeouts / 5xx	12
Failed total	372
Error rate	3.0%

The Nerve Centre headline reads Search Error Rate 3.0%, sustained 6m, 360 pool rejections, outlined in red, and the on-call engineer is paged. The card tells the story cleanly:

Shoppers are getting failed searches right now. 372 failures across the window means roughly 1.2 failed searches every second during the email burst. Each one is a shopper who searched and got an error page instead of products. On the highest-intent path on the site.
The cause is rejections, not bad data. Shard failures are zero, so the cluster is healthy and the data is intact. The 360 rejections are the search thread pool saying “my queue is full”. Pair with HTTP Connection Saturation %, which will be high during the burst, and with Search Queries per Second (live), which spiked at 12:00.
This is a capacity-vs-burst problem, not a correctness problem. The fix is not to debug a query; it is to absorb the burst: scale out search capacity, add a coordinating node, or put a short client-side retry-with-backoff in front of search so a rejected request is retried a moment later when the queue drains.

Reading the failure mix to find the cause:
  - Shard failures dominate    -> cluster health problem (check Cluster Status, Unassigned Shards)
  - Pool rejections dominate   -> capacity vs burst (check Saturation, QPS)
  - Circuit-breaking dominates  -> memory pressure (check JVM Heap, Breaker Trips)
  - Timeouts dominate          -> slow queries under load (check Slow-Query Rate, p99 latency)
This window: rejections dominate -> burst capacity, not correctness.

The actionable lesson: a search error spike is always real user pain, but the failure mix tells you which lever to pull. The same 3% headline could be a dying cluster, a memory leak, or simply too much traffic, and the breakdown distinguishes them in seconds.

Sibling cards

Card	Why pair it with this alert	What the combination tells you
Search Error Rate %	The always-on KPI gauge behind this alert.	The gauge shows the steady-state rate; this alert is the paging wrapper at >1% sustained.
Search Latency p95 (ms)	Errors and latency often spike together under load.	High latency with low errors means slow-but-working; high errors means actually failing.
Search Latency p99 (ms)	The tail that crosses into timeouts.	p99 above the timeout ceiling means the slowest queries are tipping into the error count.
HTTP Connection Saturation %	Saturation drives thread-pool rejections.	High saturation plus error spike means the burst is exhausting the search pool.
JVM Heap Used %	Memory pressure causes circuit-breaking errors.	High heap plus circuit-breaking exceptions in the failure mix means errors are memory-driven.
Cluster Status (green / yellow / red)	A red cluster produces shard-failure errors.	Non-green plus shard failures means the errors come from missing shards, not capacity.
Search Queries per Second (live)	Traffic context for the error rate.	An error spike that tracks a QPS spike is load-driven; an error spike with flat QPS is a fault.

Reconciling against the source

Where to look in Elasticsearch’s own tooling:

GET /_nodes/stats/indices/search returns query_total and query_time_in_millis; the deltas give the denominator. GET /_nodes/stats/thread_pool returns the search pool’s rejected, queue, and completed counters, the main source of capacity-driven errors. GET /_cat/thread_pool/search?v&h=node_name,active,queue,rejected gives a quick per-node rejection table. Per-request, the _shards.failed field in a search response and the failures array name the exact shard errors; the slowlog and node logs record CircuitBreakingException and EsRejectedExecutionException entries.

On a managed service, AWS OpenSearch Service / managed offerings expose SearchRate, ThreadpoolSearchRejected, and 5xx request metrics in CloudWatch; Elastic Cloud surfaces search throughput and rejection counts in the deployment monitoring view. The managed ThreadpoolSearchRejected metric maps directly to the rejection component of this card. Why our number may legitimately differ from a manual stats call:

Reason	Direction	Why
Counter vs rate	Card shows the delta	The stats counters (`query_total`, `rejected`) are lifetime cumulative; the card reports the increase over the 5-minute window, a much smaller number than the raw counter.
Sustain window	Card lags a raw read	A single poll can show a momentary 5% rate; the card only pages after >1% holds for 5 minutes, so a brief blip shows in the API but not as an alert.
Failure-mix scope	Card may read higher	The card sums shard failures, rejections, breaker exceptions, timeouts and 5xx; a single stats endpoint shows only one of those, so a single-source manual check can undercount.
Zero-result handling	Card reads lower than a naive count	Searches that legitimately match nothing are successes here; a crude “non-200 or empty” count would wrongly inflate the rate.
Managed-console window	Console can differ	CloudWatch aggregates per period (often 1 minute); aligning that window to the card’s rolling 5-minute view is needed to reconcile.

Known limitations / FAQs

Do zero-result searches count as errors? No. A search that matches nothing is a successful request that returned an empty result set; the shopper may be disappointed but Elasticsearch did its job. Only genuine failures (shard failures, rejections, breaker exceptions, timeouts, 5xx) count toward the error rate. If you want to track empty-result searches as a relevance problem, that is a separate concern from this operational alert. Why is 1% the threshold? That seems strict. Search is the highest-intent path on most stores, so even a small failure rate is disproportionately costly: a shopper who searches has already decided to buy something. On a catalogue doing 40 searches per second, 1% is roughly 1,440 failed searches an hour. The 1% line is deliberately tight because every failed search is a near-miss conversion. You can adjust the threshold per profile if your baseline is genuinely noisier. The error rate spiked but the cluster is green. How can that be? A green cluster means all shards are allocated, but it says nothing about capacity. The most common green-cluster error spike is thread-pool rejection: a traffic burst fills the search queue and Elasticsearch returns 429 to protect itself. The data is fine; you simply ran out of capacity to serve the burst. Check HTTP Connection Saturation % and the search-pool rejected counter. How do I tell a capacity problem from a correctness problem? Read the failure mix. Rejections (429) dominating means capacity versus burst, scale out or add retry-with-backoff. Shard failures dominating means a cluster-health problem, check Cluster Status and Unassigned Shards. Circuit-breaking exceptions dominating means memory pressure, check JVM Heap Used %. The headline is the same 1%+, but the cause and fix differ entirely. My client retries failed searches automatically. Does that inflate the error rate? It can. If the client retries a rejected request, each attempt is a separate request and a rejected attempt counts as a failure, so an aggressive retry loop can multiply the apparent error count while the user eventually succeeds. This is usually still a true signal (the cluster genuinely could not serve the first attempt), but if you see a much higher error rate than your users report as broken, an over-eager retry policy is the likely reason. Retry-with-backoff (not immediate retry) keeps the signal honest and is gentler on the cluster. Does this card cover indexing errors too, or only search? Only search. Indexing failures and bulk rejections are tracked separately by Bulk Rejections (24h) because they have different causes (write backpressure) and different consequences (data not yet searchable, rather than a user seeing a broken result page). Keeping read and write errors on separate cards avoids conflating a sync-pipeline problem with a storefront-search problem. The rate is just under 1% and never quite pages, but searches are clearly failing. What do I do? Lower the threshold for your profile in the Sensitivity tab. The 1% default suits a typical store, but if your baseline error rate is normally near zero, a sustained 0.5% is already abnormal for you and worth paging on. Tune the threshold to your own baseline rather than the generic default, and pair it with the Search Error Rate % gauge to watch the sub-threshold trend.

Tracked live in Vortex IQ Nerve Centre

Search Error Rate Spike (>1% in 5m) is one of hundreds of KPI pulses Vortex IQ tracks across Elasticsearch and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre