At a glance
The share of search requests that fail, expressed as a percentage of total search requests over the window. A failed search is one that returns an error (non-2xx) or completes with shard failures rather than a clean result set. This is the most direct measure of “is search working for users right now”. Unlike latency, which degrades gracefully, an error rate spike is binary from the user’s point of view: their query returned nothing usable. For a storefront, a climbing search error rate maps straight onto shoppers who cannot find products.
| API basis | Search counters from GET /_nodes/stats/indices/search and per-request shard-failure data. Errors are counted from failed search requests plus searches that completed with _shards.failed > 0 (partial results); total is search.query_total delta over the window. |
| Metric basis | A ratio: failed searches divided by total searches in the window, as a percentage. Both hard failures (rejected, timed out, malformed) and partial failures (some shards failed) are counted. |
| Aggregation window | 5m rolling, so a brief blip self-clears while a sustained problem is caught quickly. |
| Alert threshold | > 1%. Above 1% of searches failing, a meaningful slice of users is affected and the gauge trips red. |
| Why a gauge | The value is a bounded percentage with a clear danger band, so it renders as a gauge; the needle crossing into red is the page-worthy signal. |
| What counts | HTTP non-2xx search responses, thread-pool rejections on the search pool, query-phase and fetch-phase failures, timeouts, and searches returning _shards.failed > 0. |
| What does NOT count | Indexing/write errors (a separate pipeline, see Bulk Rejections), client-side network failures that never reached the cluster, and zero-result searches (a successful query that simply matched nothing is not an error). |
| Time window | 5m (rolling) |
| Alert trigger | > 1%, more than one search in a hundred failing is user-visible breakage. |
| Roles | platform, sre, dba |
Calculation
The card computes a delta ratio over the five-minute window:es_rejected_execution_exception), a timeout, or a malformed-query 4xx. Partial failures are searches that returned a 200 but with _shards.failed > 0, meaning some shards could not respond and the result set is incomplete; from a user’s perspective this is silent data loss in the results and is treated as an error here. Counting both is important because partial failures are insidious: the application sees a 200 and renders results, but those results are missing whatever the failed shards held.
The 5m window balances responsiveness against noise: a single transient failure in a low-traffic minute will not trip the gauge, but a genuine spike that affects a sustained fraction of traffic shows up within minutes. The > 1% threshold reflects that search is a primary user journey: even a low single-digit error rate means a noticeable cohort of users got a broken experience.
Worked example
A platform team runs an Elasticsearch cluster behind the search bar of a fashion retailer. Baseline search error rate is~0.02% (the odd malformed query from a bot). On 09 Apr 26 at 12:50, during the lunchtime traffic peak, the Search Error Rate gauge climbs to 3.7% and trips red.
Breaking down the failures from GET /_nodes/stats/indices/search and the cluster’s error logs:
| failure type | share of failures | signature |
|---|---|---|
| Search thread-pool rejection | 71% | es_rejected_execution_exception, search queue full |
| Partial shard failure | 24% | 200 responses with _shards.failed: 1 |
| Timeout | 5% | queries exceeding the client’s 1s timeout |
search_as_you_type field (far cheaper), and add an index.search.idle and a sensible client-side timeout-and-retry. By 13:10 the error rate is back to baseline.
- Most search-error spikes are self-inflicted load, not cluster faults. A new feature, an expensive query pattern, or a traffic burst fills the fixed-size search thread pool, and the cluster sheds load by rejecting. The fix is usually on the query/client side, not the cluster.
- Partial shard failures are silent and dangerous. A 200 with
_shards.failed > 0looks fine to the application but returns incomplete results. Counting these in the error rate surfaces a failure mode that latency and HTTP-status monitoring miss. - Read it alongside latency and saturation. Error rate is the outcome; latency and connection saturation are the leading indicators. A rising p95 that crosses into rejections is the typical path to a search-error spike.
Sibling cards
| Card | Why pair it with Search Error Rate | What the combination tells you |
|---|---|---|
| Search Latency p95 (ms) | The leading indicator before errors begin. | Rising p95 that tips into rejections is the standard route to a search-error spike. |
| Search Latency p99 (ms) | The tail that times out first. | A p99 blowout often becomes the timeout portion of the error rate. |
| Search Queries per Second (live) | The load that fills the search pool. | An error spike that tracks a QPS spike is load-driven; one that does not is a query or cluster fault. |
| HTTP Connection Saturation % | The front door that refuses clients when full. | High saturation plus errors means clients are refused before queries even run. |
| Circuit Breaker Trips (24h) | The memory-protection mechanism that rejects queries. | Breaker trips plus search errors means heavy queries are being rejected to avoid OOM. |
| JVM Heap Used % | Heap pressure causes rejections and breaker trips. | High heap plus search errors points at memory-bound query failures. |
| Slow-Query Rate % | Slow queries precede timeouts and rejections. | A rising slow-query rate is the early warning before errors climb. |
| Slow Searches During Checkout Window (5m) | The cross-channel revenue framing of search failure. | Correlates search errors with the checkout funnel to size revenue impact. |
Reconciling against the source
Where to look in Elasticsearch itself:Why our number may legitimately differ from a manual reading:GET /_nodes/stats/indices/searchgivesquery_totalandquery_time_in_millis; combined with the search thread-pool stats it shows the denominator and the rejection signal.GET /_cat/thread_pool/search?v&h=node_name,active,queue,rejectedis the fastest way to confirm search thread-pool rejections, the most common error cause; a non-zero and risingrejectedcolumn is the smoking gun. The cluster logs (or the slow log) capture per-query failures and_shards.faileddetails; the application or proxy access logs hold the authoritative non-2xx HTTP rate as the client experienced it.
| Reason | Direction | Why |
|---|---|---|
| Partial-failure counting | Card higher | We count 200 responses with _shards.failed > 0 as errors; a pure HTTP-status check at a proxy would not. |
| Window boundary | Either | The card’s 5-minute delta and your manual snapshot bracket different intervals. |
| Rejection accounting | Either | Thread-pool rejected is a cumulative counter; reading it raw versus as a windowed delta gives different rates. |
| Where errors are measured | Either | The cluster’s view (rejections, shard failures) can differ from the client/proxy view (which also sees network failures the cluster never saw). |
| Managed service abstraction | Either | Elastic Cloud and AWS-managed consoles may present an aggregated request-error metric at their own granularity. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Search Latency p95 (ms) | Errors should follow a latency climb under load. | Errors spiking with calm latency points at malformed queries or shard failures, not load. |
| Search Queries per Second (live) | A load-driven error spike tracks a QPS spike. | Errors rising with flat QPS means a query pattern changed or a node degraded, not volume. |
Known limitations / FAQs
Does a zero-result search count as an error? No. A search that runs successfully and simply matches no documents is a valid result, not a failure; it returns a 200 with an empty hits array. This card counts only searches that errored (non-2xx, rejection, timeout) or completed with_shards.failed > 0. A high zero-result rate is a relevance/merchandising concern, not a reliability one, and is tracked separately.
What is a partial shard failure and why does it count as an error?
When a search fans out to all shards of an index and one or more shards cannot respond (the node is overloaded, the shard is recovering, a circuit breaker tripped), Elasticsearch can still return a 200 with the partial results it did get, flagged by _shards.failed > 0 in the response. The application usually renders those incomplete results as if they were complete, so users silently miss whatever the failed shards held. Because that is a broken result from the user’s perspective, we count it as an error.
My error rate spiked but every failure is es_rejected_execution_exception. What does that mean?
The search thread pool is full and the cluster is shedding load by rejecting new searches to protect itself. The pool is fixed-size (roughly the node’s CPU count times 1.5 plus 1) by design, so the fix is to reduce the load reaching it: debounce or cache client queries, replace expensive query patterns (leading wildcards, deep pagination, huge aggregations) with cheaper equivalents, and add client-side timeouts with backoff. Scaling out data nodes adds search threads if the load is genuinely legitimate.
Errors are climbing but latency looks fine. How is that possible?
That pattern usually means the failures are not load-driven. Common causes: a deploy shipped a malformed query that 4xxs, a mapping change broke a query against a now-missing field, a specific shard or node is failing (partial failures) while the rest serve fast, or a circuit breaker is rejecting only the heavy queries. Look at the failure-type breakdown rather than assuming a capacity problem.
Can I tune the alert threshold?
Yes, the sensitivity threshold is configurable per profile. The default > 1% suits user-facing storefront search where any meaningful failure cohort matters. A purely internal analytics cluster with retrying batch clients might tolerate a higher threshold. Set it against your own baseline and the user impact of a failed search, not the generic default.
Why count both HTTP errors and shard failures together instead of separately?
Because from the user’s standpoint both produce a broken search experience: a hard error returns nothing, and a partial failure returns incomplete results the user cannot tell are incomplete. A single combined rate is the truest “search is broken for users” signal. For root-cause work you still get the breakdown by failure type; the headline gauge intentionally unifies them so nothing user-visible hides behind a clean HTTP-status number.
A retry on the client masks these errors. Should I still care?
Yes. Client retries can paper over a transient spike for the end user, but the cluster is still rejecting and re-serving requests, which amplifies load (each retry is another query against an already-strained pool) and can turn a small spike into a retry storm. The card measures the cluster’s true error rate before client retries, which is the honest signal of cluster health; a high rate that users do not feel today is a fragility waiting to tip over under more load.