At a glance
Alerts for Unavailable or Under-Replicated Ranges: the firing list for the two most serious states a CockroachDB range can be in. This card is CockroachDB-distinctive, it watches the replication layer that no single-node database has. An unavailable range has lost quorum: some of your data cannot be read or written right now. An under-replicated range has fewer healthy replicas than the configured replication factor: it still works, but it is one more failure away from going unavailable. Any unavailable range is a data-availability incident, full stop. Sustained under-replication is the warning that the cluster cannot self-heal fast enough.
| What it tracks | CockroachDB-distinctive: quorum loss or replication gap. Any unavailable range is a data-loss / data-availability risk. Fires on unavailable_ranges > 0 OR under_replicated_ranges > 0 held for 5 minutes. |
| Data source | The ranges.unavailable and ranges.underreplicated time-series metrics, also surfaced on the DB Console Replication dashboard (“Ranges” and “Under-replicated/Unavailable Ranges” panels) and queryable via crdb_internal range views. |
| Metric basis | Range-state counts from the replication layer, independent of SQL-level metrics. A cluster can be answering queries on healthy ranges while a subset of ranges is unavailable. |
| Time window | RT, evaluated continuously; the under-replicated arm requires a sustained 5-minute breach so that normal, self-healing rebalancing does not fire. Unavailable ranges are treated as urgent the moment they appear. |
| Alert trigger | unavailable_ranges > 0 OR under_replicated > 0 sustained 5m. |
| What counts as a firing | (1) Any range reporting unavailable; (2) Under-replicated ranges that persist above zero for 5 continuous minutes. |
| What does NOT fire | (1) Brief under-replication during a rolling restart or planned rebalance that clears within 5 minutes; (2) Replica moves (the up-replicate / down-replicate churn of a healthy balancer) that never drop a range below quorum. |
| Roles | DBA, platform, SRE |
Calculation
CockroachDB splits all data into ranges, and each range is replicated (by default 3 replicas, sometimes 5 for system or critical ranges). The replicas form a Raft group that needs a majority (quorum) to commit reads and writes. This card watches two range-state counters:ranges.unavailable) counts ranges that have lost quorum: too many replicas are on dead or partitioned nodes for the Raft group to reach a majority. While a range is unavailable, statements touching its key span block or error. This is the data-availability emergency arm, so it is treated as urgent the instant the counter is non-zero.
Under-replicated (ranges.underreplicated) counts ranges that currently have fewer live replicas than their configured replication factor (for example 2 live replicas on a range configured for 3). The range still has quorum and works, but it has lost its safety margin: one more replica failure could push it to unavailable. CockroachDB normally self-heals this by up-replicating onto a healthy node within minutes. Because brief under-replication is expected during rolling restarts, node decommissioning, and rebalancing, the under-replicated arm requires a sustained 5-minute breach before it fires. A sustained breach means the cluster cannot self-heal, usually because a node is down, the balancer is overloaded, or there is nowhere healthy to place the missing replica.
Each firing carries the unavailable count, the under-replicated count, and the nodes/stores implicated so the on-call engineer can map ranges to failed hardware.
Worked example
A platform team runs a 6-node CockroachDB cluster, replication factor 3, spread across three availability zones (two nodes per zone). Snapshot taken on 18 Apr 26 at 03:20 BST, after a zone-level network incident took two nodes offline at 03:08.| Time (BST) | Live nodes | Under-replicated | Unavailable | State |
|---|---|---|---|---|
| 03:05 | 6 | 0 | 0 | healthy |
| 03:08 | 4 | 0 | 0 | two nodes drop (zone B) |
| 03:09 | 4 | 1,240 | 0 | ranges lose a replica each |
| 03:14 | 4 | 1,180 | 0 | self-heal stalling |
| 03:20 | 4 | 1,160 | 38 | alert fires (unavailable + sustained under-replication) |
- Treat unavailable as the priority. 38 unavailable ranges means a slice of the keyspace is unreadable and unwritable. Cross-read Unavailable Ranges for the live count and which tables those ranges back. The fastest recovery is to bring the two zone-B nodes back: restoring even one of them can hand quorum back to those 38 ranges instantly.
- Confirm the node loss. Check Cluster Node Count and Active Nodes (status=live). Both should read 4, confirming the two-node loss is the root cause rather than a metrics artefact.
- Decide on the under-replication path. If the zone-B nodes are coming back within minutes, do nothing further: the under-replicated ranges will re-replicate onto the returning nodes and the unavailable ranges recover. If the nodes are gone for good, the cluster will eventually re-replicate the survivors onto the four healthy nodes, but only the unavailable ranges that still have at least one surviving replica can recover. Ranges that lost all replicas in zone B require restore from backup, which is why Last Successful Backup (hours ago) matters at exactly this moment.
- Unavailable and under-replicated are not the same severity. Under-replicated is a warning (safety margin lost, usually self-heals). Unavailable is an active outage (data cannot be served). The card fires on either, but you triage unavailable first, always.
- Sustained under-replication means self-healing failed. A brief blip during a restart is normal and will not fire. A 5-minute-plus breach means the cluster cannot place the missing replicas: a node is down, disks are full, or placement constraints leave nowhere legal to put them.
- Backups are the floor under this card. If a range loses every replica, no amount of cluster recovery brings it back; only a restore does. The freshness of your backup is the difference between minutes of recovery and permanent data loss.
Sibling cards
| Card | Why pair it with Unavailable or Under-Replicated Ranges | What the combination tells you |
|---|---|---|
| Unavailable Ranges | The live count for the urgent arm of this alert. | Confirms how much of the keyspace is currently unserveable. |
| Under-Replicated Ranges | The live count for the warning arm. | A non-zero sustained value means the balancer cannot keep up. |
| Cluster Node Count | The usual root cause: a lost node. | A drop in node count immediately before the firing pinpoints the failed node. |
| Active Nodes (status=live) | The live-node headcount feeding quorum. | If live nodes fell, the replication gap is a direct consequence. |
| Raft Quiescent Lag (seconds) | Replication-health peer. | High Raft lag alongside under-replication means replicas are struggling to catch up. |
| Decommissioning Nodes | A stuck decommission is a common cause of sustained under-replication. | A long-running decommission with persistent under-replication means draining is blocked. |
| Last Successful Backup (hours ago) | The recovery floor when ranges lose all replicas. | A stale backup at the moment of an unavailable firing is the worst case: no clean restore point. |
| CockroachDB Health Score | The executive composite this alert dominates. | Any unavailable range should push the health score well below the alert line. |
Reconciling against the source
Where to look natively:DB Console Replication dashboard for the “Under-replicated Ranges” and “Unavailable Ranges” panels (the canonical live view). DB Console Problem Ranges report (Advanced Debug page) to list the specific ranges, their replicas, and which nodes hold them.Why our number may legitimately differ from the native view:SELECT range_id, unavailable, under_replicated FROM crdb_internal.ranges WHERE unavailable OR under_replicated;to enumerate the affected ranges and trace them to tables.cockroach node status --rangesfor the per-node replica and range health view. CockroachDB Cloud Metrics tab plots the sameranges.unavailableandranges.underreplicatedseries; the cluster page flags replication health.
| Reason | Direction | Why |
|---|---|---|
| Sustain filter (under-replicated) | Vortex IQ fires less often | The DB Console graph shows every momentary under-replication during a rebalance; this card only fires when under-replication persists for 5 minutes. |
| Unavailable urgency | Same | Both surface unavailable ranges immediately; there is no sustain delay on that arm. |
| Aggregation timing | Brief lag | Range counts are gossiped and polled; a count can settle a poll later than the live DB Console graph. |
| Counting scope | Either way | This card counts cluster-wide range states; the Problem Ranges report can additionally show ranges with other issues (no leaseholder, raft leader/leaseholder split) that are not part of this card’s two counters. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Cluster Node Count | A node-count drop should precede most firings. | A firing with no node loss points to disk-full stores or placement-constraint deadlock rather than a dead node. |
| CRDB Inventory Rows vs Ecom Inventory | Unavailable ranges backing inventory tables can stall stock sync. | Inventory drift appearing during an unavailable firing confirms the outage is hitting customer-facing data. |