At a glance
An alert card that fires when the Elasticsearch cluster health status leavesgreenand stays atyelloworredfor a sustained window. This is the single most Elasticsearch-defining signal on the board.greenmeans every primary and every replica shard is allocated.yellowmeans all primaries are allocated but one or more replicas are missing, so you are running without redundancy.redmeans at least one primary shard is unallocated, which means data on that shard is unavailable to reads and writes right now. A red cluster is a page-the-on-call event because part of your index is simply not answering.
| What it tracks | The cluster-level status field returned by the Elasticsearch cluster health API, surfaced as an alert when it is not green. The card lists the affected indices, the count of unassigned shards, and the duration the cluster has been off-green. |
| Data source | GET /_cluster/health (and GET /_cluster/health?level=indices for the per-index breakdown). The status field is the canonical traffic-light value Elasticsearch computes from shard allocation state. Detail: “Elasticsearch-distinctive, RED = data unavailable on affected indexes. Page on-call.” |
| Time window | RT (real-time, polled on the standard fast cadence for alert cards). |
| Alert trigger | status in (yellow, red) sustained for 5 minutes. The 5-minute sustain is deliberate: brief yellow flaps during a rolling restart or a routine relocation are normal and should not page anyone. |
| Severity split | yellow is a warning (redundancy lost, data still served). red is critical (data unavailable, page on-call immediately). The card renders red status outlined in red. |
| What does NOT trigger it | A green cluster with high heap, high disk, or slow queries. Those are separate cards. This card is purely the shard-allocation traffic light. |
| Roles | platform, SRE, DBA, on-call |
Calculation
Elasticsearch computes the cluster status itself; Vortex IQ does not derive it. The engine pollsGET /_cluster/health and reads the status string, which Elasticsearch sets according to these rules:
- green: all primary shards and all replica shards are allocated.
- yellow: all primary shards are allocated, but at least one replica is unallocated. Reads and writes still succeed; you have simply lost redundancy for the affected indices.
- red: at least one primary shard is unallocated. Any document that lives on that shard cannot be read or written until the shard recovers.
GET /_cluster/health?level=indices and surfaces the indices whose own status is yellow or red, along with unassigned_shards per index. That is what turns “the cluster is red” into “the products-v7 primary is unallocated”.
Worked example
A platform team runs a 6-node Elasticsearch cluster (3 data, 3 master-eligible) backing product search and a logging pipeline for a mid-size retailer. Snapshot taken on 14 Apr 26 at 02:18 BST, during an overnight maintenance window. A data node (es-data-02) ran out of disk after a log-index rollover went wrong and was evicted from the cluster. The cluster health API now returns:
?level=indices) shows:
| Index | Status | Unassigned shards | Notes |
|---|---|---|---|
products-v7 | red | 2 | One primary + one replica unallocated; the primary lived on es-data-02 and has no in-sync copy elsewhere |
logs-2026.04.14 | yellow | 8 | Primaries intact on the surviving nodes; replicas gone |
categories-v3 | yellow | 4 | Replicas gone, primaries fine |
products-v7is red, so product search is partly broken. Any query hitting the missing primary’s shard returns partial results or a shard-failure error. Storefront search for the SKUs on that shard is degraded. Pair with Unassigned Shards to see the 14 unassigned shards and with Active Node Count, which dropped from 6 to 5 at 02:12.- The yellow indices are not the emergency.
logs-2026.04.14andcategories-v3lost replicas only; their data is still fully served. They will go green again automatically oncees-data-02rejoins and shards re-replicate. - Root cause is disk, not the shard layer. Storage Usage % on
es-data-02had crossed the flood-stage watermark before the node dropped. The fix is to free disk or add capacity, after which the unallocated primary recovers from its translog.
Sibling cards
| Card | Why pair it with Cluster Not Green | What the combination tells you |
|---|---|---|
| Cluster Status (green / yellow / red) | The always-on KPI version of the same status field. | This alert is the paging wrapper around that KPI; the KPI shows the steady-state colour. |
| Unassigned Shards | The direct cause of yellow/red. | Any unassigned shard explains the colour; zero unassigned with non-green means a transient relocation. |
| Active Node Count | A lost node is the most common trigger. | Node count dropping at the same instant the cluster went red points straight at the missing node. |
| Storage Usage % | A full disk evicts nodes and blocks shard allocation. | A watermark breach just before red means disk is the root cause. |
| Initializing / Relocating Shards | Recovery from red passes through initialising shards. | High initialising count after a red event means the cluster is healing; falling to zero means recovery is done. |
| Elasticsearch Health Score | The composite that weights cluster status heavily. | A red cluster alone drags the health score well below the 70 alert line. |
| Pending Cluster Tasks | Master-node backlog during mass reallocation. | A red event plus a high pending-task queue means the master is busy reassigning shards. |
Reconciling against the source
Where to look in Elasticsearch’s own tooling:On a managed service, the same value appears in the console: Elastic Cloud surfaces it under the deployment Health view, and AWS OpenSearch Service / managed Elasticsearch offerings expose it as theGET /_cluster/healthreturns the authoritativestatusfield. This is the exact value the card reads.GET /_cluster/health?level=indicesbreaks the status down per index so you can see which index is responsible.GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reasonlists every shard and the reason any are unassigned, in human-readable rows.GET /_cluster/allocation/explainexplains in plain terms why a specific shard cannot be allocated.
ClusterStatus.green / .yellow / .red CloudWatch metrics. The console may lag the live API by a minute or two because it samples on its own schedule.
Why our number may legitimately differ from a manual _cluster/health call:
| Reason | Direction | Why |
|---|---|---|
| Sustain window | Card lags a raw API call | A bare _cluster/health shows yellow the instant a replica drops; the card only fires after 5 minutes of sustained non-green, so a brief flap shows in the API but not as an alert. |
| Poll cadence | Up to one poll interval | The card samples on a fixed interval; a status that flips green between polls may never be captured. |
| Per-index drill timing | Index list can lag the colour | The headline colour and the per-index breakdown are two API calls; during fast recovery the colour can update before the index list refreshes. |
| Managed-console sampling | Console can lag both | Elastic Cloud and CloudWatch sample on their own cadence, so the vendor console can show green while the live API still reports yellow for a few seconds. |
Known limitations / FAQs
Why did the cluster go yellow during a routine node restart, and is that a problem? When you stop a node for a rolling restart, the replicas that lived on it become unallocated, so the cluster goes yellow until those replicas re-replicate elsewhere or the node returns. This is expected and harmless: all primaries are still allocated, so reads and writes succeed. The 5-minute sustain window is designed to swallow short restarts; if you do slow rolling restarts, consider raising the sustain window or usingcluster.routing.allocation.enable: primaries during the maintenance so replicas are not reshuffled needlessly.
The cluster is yellow but everything works fine. Why page anyone?
Yellow does not page on-call; only red does in the default configuration. Yellow is a warning that you have lost redundancy: if a second node fails while you are yellow, you risk going red. Treat yellow as “fix soon”, red as “fix now”. A cluster that sits yellow for days usually has a real allocation problem (a disk watermark, a stuck shard, or index.number_of_replicas set higher than the number of available nodes).
Can a single-node cluster ever be green?
No, not for any index with replicas. A replica shard can never be allocated to the same node as its primary, so a single-node cluster with the default 1 replica is permanently yellow. This is by design, not a fault. For a deliberate single-node setup, set index.number_of_replicas: 0 and the cluster can be green, but then you have no redundancy at all.
What is the difference between this alert and the Unassigned Shards card?
This card reads the cluster status traffic light directly. The Unassigned Shards card counts the shards behind that colour. They move together: any unassigned replica makes the cluster yellow, any unassigned primary makes it red. Use this card to know “is data unavailable?”, use Unassigned Shards to know “how many shards and which ones?”.
The cluster went red but search still returns results. How?
Search against a red cluster returns partial results: shards that are allocated answer, the unallocated primary’s shard returns a shard-failure, and Elasticsearch merges what it can unless you set allow_partial_search_results: false. So the storefront may look mostly fine while silently dropping the SKUs on the missing shard. That silent partiality is exactly why red is a paging event even when the site “looks okay”.
Does this card cover index-level red, or only cluster-level?
The headline reflects the cluster status, which is the worst of any single index. The per-index drill (sourced from ?level=indices) names the specific red or yellow indices so you are not left guessing. A single red index makes the whole cluster red even if 500 other indices are green.
How long does recovery from red usually take?
It depends entirely on the cause. If a node rejoins and its primary recovers from the translog, recovery can be seconds to a few minutes. If a primary has to be rebuilt from a replica on another node, it scales with shard size and network throughput, often tens of minutes for large shards. Watch Initializing / Relocating Shards trend to zero as the signal that recovery is finishing.