At a glance
Unavailable Ranges is the most serious single number on the CockroachDB board. A range is a contiguous slice of the keyspace replicated across nodes; it needs a quorum of its replicas alive to serve reads and writes. A range that has lost quorum is unavailable, which means some of your data cannot be read or written at all until quorum is restored. This is not “slow”, it is “offline”. The healthy value is exactly zero, always. Any value above zero is a hard alert, because even a single unavailable range means a portion of the database is down. This card answers the one question that matters most during an outage: “is any of my data actually inaccessible right now?”
| What it tracks | The number of ranges currently lacking quorum, and therefore unable to serve reads or writes, across the whole cluster. |
| Data source | CockroachDB-distinctive: ranges lacking quorum. Any value above zero means some data is unavailable for reads and writes. Vortex IQ reads the ranges.unavailable cluster metric and crdb_internal.kv_store_status (the per-store range health counters), corroborated by node liveness in crdb_internal.gossip_liveness. On CockroachDB Cloud the same figure is read via the Cloud metrics API and shown on the cluster Overview. |
| Time window | RT (real-time, continuously evaluated, because quorum loss must be caught instantly). |
| Alert trigger | > 0. There is no tolerance band: any unavailable range at all is a page-someone-now event. |
| Roles | DBA, platform, SRE, incident commander |
Calculation
There is no derivation or smoothing here; the card surfaces CockroachDB’s own count of ranges that cannot reach quorum.- What quorum means. Each range is replicated (the default replication factor is 3, often 5 for critical clusters). To commit a write or serve a consistent read, a majority of a range’s replicas must be available: 2 of 3, or 3 of 5. Lose the majority and the range can no longer make progress, it is unavailable.
- How a range becomes unavailable. The usual cause is losing more nodes (or stores) than the range’s fault tolerance allows at once: with replication factor 3 you can lose one node and stay available, but lose two of the three replicas for a range simultaneously and that range goes unavailable. Disk failures, network partitions isolating a majority, and zone/region outages on a poorly-distributed cluster are the common triggers.
- The source counter. The
ranges.unavailablemetric is the cluster-wide count;crdb_internal.kv_store_statuscarries the per-store view. Because this is computed from raft replica state, it reflects true quorum loss, not transient leaseholder churn. - Why it is real-time. Unlike capacity or latency, which trend, quorum loss is binary and urgent: data is either reachable or it is not. The card evaluates continuously so a quorum loss is visible the moment it happens.
Worked example
A platform team runs a 6-node CockroachDB cluster (v23.2, replication factor 3) across three availability zones, two nodes per zone, backing an ecommerce order and inventory stack. At 02:14 BST on 14 Apr 26 the on-call SRE is paged: Unavailable Ranges has gone from 0 to 37. The first read tells the whole story:| Signal | Reading | Meaning |
|---|---|---|
| Unavailable Ranges | 37 | 37 ranges have lost quorum, some data is offline. |
| Cluster Node Count | 4 of 6 expected | Two nodes are down. |
| Node liveness | nodes 5 and 6 DEAD | Both dead nodes are in the same availability zone. |
| Under-Replicated Ranges | 880 and climbing | Many ranges lost one replica but kept quorum. |
- Zero is the only acceptable value. Unlike latency or capacity, there is no “amber” band. Any non-zero reading means data is offline and is always an incident.
- Read it next to node count and under-replication. Unavailable plus dead nodes tells you the cause (quorum loss from node failure); under-replicated ranges climbing in parallel tells you how close the rest of the cluster is to the same fate. Pair with Cluster Node Count and Under-Replicated Ranges.
Sibling cards
| Card | Why pair it with Unavailable Ranges | What the combination tells you |
|---|---|---|
| Under-Replicated Ranges | The “lost a replica but kept quorum” sibling. | Under-replicated rising toward unavailable shows how close more ranges are to going offline. |
| Cluster Node Count | Quorum loss is almost always caused by node loss. | Unavailable ranges with a dropped node count names the failure as a node/zone outage. |
| Unavailable or Under-Replicated Ranges | The combined alert that fires on either condition. | The alert-list card that pages on this card or its under-replicated sibling crossing zero. |
| Active Nodes (status=live) | The live-node liveness view. | Fewer live nodes than expected explains where the lost quorum came from. |
| Decommissioning Nodes | A botched or too-fast decommission can strip quorum. | Unavailable ranges during a decommission means the drain removed replicas too aggressively. |
| Raft Quiescent Lag (seconds) | Replication health on surviving ranges. | High raft lag alongside unavailability means the survivors are struggling to re-replicate. |
| Last Successful Backup (hours ago) | The recoverability backstop. | If unavailable ranges cannot be recovered, a fresh backup is your last line of defence. |
| CockroachDB Health Score | The composite where availability is the heaviest axis. | A single unavailable range can drag the whole health score below 70 on its own. |
Reconciling against the source
CockroachDB exposes unavailable ranges natively, so this card is a direct read and reconciliation is quick:- DB Console. The Cluster Overview and the Replication dashboard both show the unavailable-range count prominently, and the Problem Ranges page (
/#/reports/problemranges) lists the specific ranges that lack quorum, with their replica locations. This is the page to open during an incident: it names the ranges and the tables they belong to. - Cluster metrics. The
ranges.unavailabletime-series in the Metrics dashboard is the same counter the card reads. crdb_internaltables.SELECT * FROM crdb_internal.kv_store_status;exposes per-store range health, and thesystem.replication_stats/crdb_internal.rangesviews let you locate unavailable ranges and their replica sets in SQL.crdb_internal.gossip_livenessconfirms which nodes are dead.cockroach debugtooling. For recovery, thecockroach debug recoverfamily of commands operates on the same range state when quorum cannot be restored by bringing nodes back.
Known limitations / FAQs
What is the difference between unavailable and under-replicated? Under-replicated means a range has fewer replicas than configured but still has a quorum, so it keeps serving reads and writes while the cluster re-replicates it. Unavailable means the range has lost quorum and cannot serve reads or writes at all, the data is offline. Under-replication is a self-healing warning; unavailability is an outage. See Under-Replicated Ranges. The count is above zero but only briefly. Is that still an incident? Treat it as one until proven otherwise. A brief blip can happen during a node restart if a range momentarily loses quorum before a replica catches up, but it can equally be the first second of a real outage. Because the cost of a true unavailable range is so high (offline data), the alert has no tolerance band. Investigate every non-zero reading; close it out if it was a transient restart artefact. Will unavailable ranges fix themselves? Sometimes. If the cause is nodes that come back (a transient zone outage, a node restart), the ranges regain quorum the moment enough replicas return and unavailability clears automatically. If the nodes are permanently lost, the ranges stay unavailable until you either replace the nodes or run CockroachDB’s range-recovery tooling to restore quorum from the surviving replica, which can mean losing the most recent unreplicated writes. Can a planned operation cause unavailable ranges? Yes, if done carelessly. Decommissioning or shutting down too many nodes at once, or nodes that hold a majority of the same range’s replicas, can strip quorum. Always decommission one node at a time and let ranges re-replicate before removing the next; watch Decommissioning Nodes and Under-Replicated Ranges during the operation. How do I find out which tables are affected? Open the DB Console Problem Ranges page (/#/reports/problemranges), which lists each unavailable range and its start/end keys; those keys map to specific tables and indexes. You can also query crdb_internal.ranges to join range IDs to table names. Knowing whether the 40 unavailable ranges are cold archive data or your live orders table changes the urgency completely.
How do I prevent this in future?
Survivability is a placement problem. With replication factor 3 across three zones, two nodes per zone, a single zone outage can take two replicas of a range. To survive a full zone or region loss, use replication factor 5 spread across at least five failure domains, or apply replica-placement constraints so no two replicas of a range share a zone. CockroachDB’s multi-region and zone-config features exist precisely for this.