At a glance
A Redis Cluster divides the keyspace into exactly 16384 hash slots, and every slot must be owned by a reachable primary for the whole keyspace to be serveable. This card is the live gauge of how many of those slots are currently healthy and assigned, read straight fromcluster_slots_okinCLUSTER INFO. A reading of 16384 is full coverage; anything below means some keys are unreachable and operations on those slots fail. For a platform or SRE team this is the heartbeat of cluster availability: one number that says “is my entire keyspace serveable right now?”
| Data source | cluster_slots_ok from CLUSTER INFO, read across all reachable nodes. <16384 means some keys unreachable (operations on those slots fail). |
| Metric basis | Slot-ownership health, not key count or memory. Each of the 16384 slots is either owned by a reachable primary (counted) or not (missing). |
| Full-coverage value | 16384. A healthy cluster always reads exactly 16384 with cluster_state:ok. |
| Aggregation window | RT (real-time). The gauge re-reads CLUSTER INFO on every poll cycle. |
| Alert trigger | <16384. Any reading below full coverage means part of the keyspace is dark; this is the same condition the Cluster Slot Coverage Gap alert pages on. |
| What does NOT count toward coverage | (1) Slots whose primary is down with no promoted replica; (2) slots in fail/pfail state. Slots in MIGRATING/IMPORTING during a reshard are still served and still counted. |
| Topology scope | All shards in the cluster the connector targets. On managed services (AWS ElastiCache cluster mode, Azure Cache for Redis Enterprise, Redis Cloud) the same CLUSTER INFO view is read through the configured endpoint. |
| Standalone instances | Not applicable. A non-cluster instance has no hash slots and reads n/a. |
| Time window | RT (real-time, re-evaluated on every poll) |
| Alert trigger | <16384 |
| Roles | owner, engineering, operations |
Calculation
The card issuesCLUSTER INFO and reads the cluster_slots_ok line directly:
cluster_slots_ok is Redis’s own count of slots that are both assigned to a primary and whose primary is currently in an ok (reachable) state. The headline shows the raw count against 16384 and the coverage percentage. Two companion fields refine the reading: cluster_slots_pfail (slots whose owner is suspected dead by some node but not yet agreed) and cluster_slots_fail (slots whose owner is agreed dead). When everything is healthy these are both zero and cluster_slots_ok is 16384.
Because each node holds its own view of the cluster and a network-partitioned node can report a stale, optimistic count, Vortex IQ reads CLUSTER INFO from every reachable node and takes the lowest cluster_slots_ok it sees. That ensures a minority node cannot mask a genuine coverage shortfall with a rosy local reading.
Worked example
A platform team runs a Redis Cluster of three primaries (each with one replica) backing a session store and a read-through cache. Slots are split evenly: 0 to 5460, 5461 to 10922, 10923 to 16383. Snapshot taken on 09 May 26 across a 12-minute window during a rolling node upgrade.| Time (BST) | Event | cluster_slots_ok | cluster_state | Coverage |
|---|---|---|---|---|
| 10:00 | Steady state | 16,384 | ok | 100% |
| 10:04 | Primary C taken down for upgrade | 10,922 | fail | 66.7% |
| 10:04:09 | Replica C-rep promoted to primary | 16,384 | ok | 100% |
| 10:08 | Upgraded C rejoins as replica | 16,384 | ok | 100% |
cluster_slots_ok dropped to 10,922 (two shards’ worth) and cluster_state read fail. As soon as C-rep was promoted, coverage returned to 16384.
- The dip was expected, the recovery was automatic. Because primary C had a healthy replica, the cluster promoted it within the failover timeout and coverage was restored without intervention. A planned rolling upgrade shard-by-shard should produce exactly this pattern: brief dips that self-heal.
- The size of the dip tells you how much was at risk. Two shards’ worth missing (5462 slots) means a third of the keyspace was unreachable for those 9 seconds. Had two shards been down at once, the dip would be larger and the recovery slower.
- A dip that does not recover is the real incident. If coverage had stayed at 10,922, it would have meant C had no replica to promote, turning a routine upgrade into a sustained outage. The value of this gauge is watching it return to 16384 promptly.
- 16384 is the only fully healthy reading. Any other number, even 16383, means at least one slot is unreachable and some keys are erroring. There is no “nearly full coverage” that is safe; it is binary in customer terms.
- Brief dips during failover are normal; persistent shortfalls are incidents. Watch the gauge return to 16384. Speed of recovery is governed by
cluster-node-timeoutand whether a replica exists to promote. - This gauge and the coverage-gap alert are the same signal, two views. This card is the continuous number; the Cluster Slot Coverage Gap alert is its threshold page. Read them together: the gauge for trend, the alert for the wake-up.
Sibling cards to read alongside this one
| Card | Why pair it with Cluster Slots Assigned | What the combination tells you |
|---|---|---|
| Cluster Slot Coverage Gap (<16384 slots assigned) | The threshold alert this gauge feeds. | Same cluster_slots_ok: this card is the live number, that one is the page. |
| Connected Replicas | Replicas are what restore coverage after a primary dies. | Full coverage but zero replicas on a shard equals one host loss away from a gap. |
| Replica Lag (seconds) | A promoted replica with high lag restores coverage but loses writes. | High lag at promotion equals coverage back, recent writes gone. |
| Redis Health Score | The executive composite that coverage dominates. | Any drop below 16384 collapses the health score; this card is the cause. |
| Instance Uptime | A reset uptime on a shard explains a coverage dip. | A recent restart on a node aligns with the dip in coverage. |
| Operations per Second (live) | Throughput tracks coverage during a dip. | OPS falling in proportion to lost slots confirms client errors on the dark range. |
Reconciling against the source
Where to look in Redis itself:Why our number may legitimately differ from a single node’s view:CLUSTER INFOis the authority:redis-cli -c CLUSTER INFOshowscluster_state,cluster_slots_assigned, andcluster_slots_ok.CLUSTER SHARDS(Redis 7+) orCLUSTER NODESmaps each slot range to its owning node, so a shortfall can be traced to a specific primary.CLUSTER SLOTSreturns the slot-to-node assignment as a structured list; a missing range is simply absent.redis-cli --cluster check <host>:<port>runs Redis’s own coverage audit and prints “[OK] All 16384 slots covered” or names the uncovered slots.
| Reason | Direction | Why |
|---|---|---|
| Per-node staleness | We may show a lower count momentarily | A partitioned node reports its own optimistic cluster_slots_ok; we read all nodes and take the lowest, so we can show a dip a majority node has not yet agreed. |
| Failover in flight | Transient dip then recovery | During promotion the count drops then returns; a CLUSTER INFO read after recovery shows 16384, while we captured the dip. |
| Reshard in progress | No change, despite busy CLUSTER NODES | MIGRATING/IMPORTING slots are still served and still counted, so coverage stays 16384 throughout a reshard. |
| Poll cadence | We may miss a sub-poll flap | A coverage dip shorter than the poll interval can be missed by both our gauge and a manual check; only sustained or repeated dips are reliably captured. |
CLUSTER INFO through the configured endpoint, and each surfaces a “shards healthy” or “node group” health view in its own console. Reconcile our coverage count against the console’s healthy-shard count: on an evenly split three-shard cluster, one unhealthy shard corresponds to roughly 5461 missing slots, two shards to roughly 10922.
Known limitations / FAQs
My instance is a single standalone Redis. Why does this card read n/a? Hash slots only exist in Redis Cluster mode. A standalone primary owns the whole keyspace implicitly and reports nocluster_slots_ok, so the gauge reads n/a and does not alert. For availability monitoring of a standalone setup, watch Connected Replicas and Instance Uptime instead.
The gauge dipped below 16384 for a few seconds during a node upgrade and recovered. Was that bad?
No, that is the expected pattern for a rolling upgrade. When you take a primary down, its slots are briefly unowned until a replica is promoted (bounded by cluster-node-timeout), so coverage dips and then returns. A self-healing dip means failover worked. The concerning case is a dip that does not recover, which means the dead primary had no replica to promote.
What is the difference between cluster_slots_assigned and cluster_slots_ok?
cluster_slots_assigned counts slots that have an owner configured (regardless of whether that owner is currently reachable); it should always be 16384 on a properly set-up cluster. cluster_slots_ok counts slots whose owner is configured and reachable. The gap between them is your coverage problem: assigned 16384 but ok 10922 means owners exist but one is down.
Can coverage read 16384 while I still have a problem?
Yes, in a subtle way. cluster_slots_ok only measures slot ownership and primary reachability. A cluster can be at full coverage while a replica is missing or lagging badly, so you are at full coverage but with no resilience: the next primary loss would open a gap. Always read this gauge with Connected Replicas and Replica Lag (seconds) to confirm you can survive a failover, not just that you are healthy now.
During a reshard CLUSTER NODES shows slots MIGRATING. Why does coverage stay at 16384?
Migrating and importing slots are still served by their current owner throughout the move; clients are redirected with ASK/MOVED but never get CLUSTERDOWN. So coverage stays at full throughout a healthy reshard. A dip during a reshard would indicate the operation broke, which is rare and worth investigating.
We run cluster-require-full-coverage no. Does this gauge still reflect reality?
Yes. That setting changes how the cluster behaves when a slot is unserved (it keeps serving the slots it still owns rather than refusing commands cluster-wide) but it does not change cluster_slots_ok. The gauge reads the slot count directly, so a shortfall shows up whether or not cluster_state reports fail.
On ElastiCache the console says all shards healthy but the gauge dipped. Which do I trust?
Check timing and cadence. Managed-service health views often poll on a coarser interval (around 60 seconds) and smooth transient states, while we read CLUSTER INFO in real time and take the most pessimistic node view. A brief dip during an ElastiCache node replacement can be invisible in the console but real on the wire. Settle it with redis-cli --cluster check against the endpoint, which queries every node and reports actual coverage at that moment.