At a glance
A yes / no readiness verdict: if the primary died this second, is there a healthy standby that could be promoted with near-zero data loss? Readiness is true only when at least one standby is connected, streaming, caught up (lag under one second), and confirmed receiving WAL. This is the card a platform team glances at before every risky deploy, every maintenance window, and every time the pager goes quiet for too long. A green primary with no promotable standby is a single point of failure dressed up as high availability.
| What it tracks | Whether the cluster can survive losing the primary right now. “Failover Readiness for the selected period.” Readiness combines standby presence, streaming state, replication lag, and WAL-receipt confirmation into one verdict. |
| Data source | On the primary: pg_stat_replication (one row per connected standby, with state, sync_state, write_lsn, flush_lsn, replay_lsn). On the standby: pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn() to confirm it is applying WAL. For managed clusters (RDS Multi-AZ, Aurora, Cloud SQL HA) the engine reads the provider’s replica health and the standby’s apply lag. |
| Time window | RT (real-time, evaluated on the live polling cycle). |
| Alert trigger | no healthy standby with lag <1s. If every standby is disconnected, broken, or lagging beyond one second, readiness flips to not-ready and the card pages the on-call DBA. |
| Readiness criteria | (1) At least one standby row in pg_stat_replication; (2) its state is streaming; (3) replay lag under 1 second; (4) WAL is being received (flush_lsn advancing). All four must hold. |
| What does NOT count | Standbys in catchup or backup state (still warming up), cascading replicas behind another standby, logical-replication subscribers (they are not promotable as a physical standby), and async standbys that have fallen far behind. |
| Roles | owner, engineering, operations |
Calculation
The verdict is computed every polling cycle from the primary’spg_stat_replication view plus a confirmation read from each standby.
For each connected standby the engine derives replay lag two ways:
- LSN distance:
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)on the primary gives the byte gap between what the primary has written and what the standby has replayed. Converted to a time estimate using recent apply throughput. - Time-based:
pg_last_xact_replay_timestamp()on the standby compared against the current time gives the wall-clock age of the last replayed transaction. This is the figure used for the one-second threshold because it is the closest proxy for “how much committed data would we lose if we promoted right now”.
streaming state with time-based replay lag under one second. The card then reports:
- Ready when at least one synchronous or low-lag asynchronous standby meets every criterion.
- Not ready when no standby qualifies: none connected, the only standby is in
catchup, lag exceeds one second, or WAL receipt has stalled.
sync_state = sync) guarantees zero data loss on promotion because the primary waits for it to flush before acknowledging commits. An asynchronous standby under one second of lag is “almost zero loss” but not guaranteed. The card treats both as ready but the drill-down distinguishes them, because for a strict RPO of zero only a synchronous standby qualifies.
On managed HA (RDS Multi-AZ, Aurora, Cloud SQL HA) the provider abstracts the standby, so the engine reads the provider’s reported replica health and apply lag rather than pg_stat_replication directly. The readiness logic is the same: a healthy, low-lag standby equals ready.
Worked example
A platform team runs a self-managed PostgreSQL 16 cluster: one primary, one synchronous standby in the same region, one asynchronous standby in a DR region. Snapshot taken on 22 May 26 at 16:40 BST, fifteen minutes before a planned schema migration.| Standby | state | sync_state | Replay lag | Qualifies? |
|---|---|---|---|---|
| db-standby-a (same region) | streaming | sync | 0.2 s | Yes |
| db-standby-b (DR region) | streaming | async | 0.9 s | Yes |
ALTER TABLE on a 200 GB table generates a flood of WAL:
| Standby | state | sync_state | Replay lag | Qualifies? |
|---|---|---|---|---|
| db-standby-a (same region) | streaming | sync | 0.4 s | Yes |
| db-standby-b (DR region) | streaming | async | 47 s | No (lag > 1 s) |
catchup after a network blip and never recovered. The primary looks perfectly healthy. Readiness reads Not ready and pages the DBA, who discovers there has been no promotable standby for six days. Without this card, that gap would only be discovered the moment the primary actually failed, which is the most expensive possible time to learn it.
Three lessons platform teams should carry:
- A healthy primary tells you nothing about failover safety. The whole point of this card is that the primary can be perfectly green while you have zero ability to recover from its loss. Read readiness independently of primary health.
- Lag is the data-loss meter. The one-second threshold is a proxy for RPO. A standby 47 seconds behind means promoting it loses up to 47 seconds of committed transactions. For strict zero-loss requirements, only a synchronous standby counts.
- Standbys fail silently. A standby can drop into
catchup, stall on WAL receipt, or fall behind a slow link without anyone noticing, because nothing user-facing breaks until you actually need it. Continuous readiness monitoring is the only reliable way to know.
Sibling cards
| Card | Why pair it with Failover Readiness | What the combination tells you |
|---|---|---|
| Replication Lag (seconds) | The raw lag number that drives the readiness verdict. | Lag creeping toward one second is your early warning that readiness is about to flip. |
| Active Streaming Replicas | The count of standbys actually streaming. | A drop here is the most common reason readiness goes not-ready: a standby disconnected. |
| WAL Lag Bytes (primary to standby) | The byte gap underlying time-based lag. | Sustained WAL-byte growth means the standby cannot keep up and readiness will degrade. |
| Replication Lag Exceeds Threshold or Standby Unreachable | The alert feed for broken replication. | An entry here usually corresponds to readiness flipping to not-ready. |
| PostgreSQL Health Score | The composite that includes replication health as a component. | Not-ready failover drags the composite down even when latency and errors look fine. |
| Last Successful Backup (hours ago) | The other half of your recovery story: PITR if failover is impossible. | No promotable standby plus stale backup equals you have no recovery path at all. |
| Database Disk Usage % | A full standby disk stops WAL apply and breaks readiness. | High standby disk plus rising lag equals the standby is about to stall. |
Reconciling against the source
Where to look in PostgreSQL:On the primary:Why our verdict may legitimately differ from a manual check:SELECT application_name, state, sync_state, replay_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS bytes_behind FROM pg_stat_replication;shows every connected standby, its state, and how far behind it is. On the standby:SELECT pg_is_in_recovery();should return true, andSELECT now() - pg_last_xact_replay_timestamp() AS replay_age;gives the wall-clock lag that the one-second threshold compares against. WAL receipt confirmation:SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();on the standby; if receive is advancing but replay is stuck, the standby is receiving but not applying. Managed HA: the RDS / Aurora console shows the Multi-AZ standby and replica lag under the instance detail; Cloud SQL shows the HA failover replica health on the instance overview.
| Reason | Direction | Why |
|---|---|---|
| Sampling moment | Either way | A manual pg_stat_replication query and our poll can land seconds apart; during a deploy lag can swing across the one-second line between them. |
| Time-based vs LSN lag | Either way | We use pg_last_xact_replay_timestamp() for the threshold; an LSN-byte check can look fine on a quiet write workload (few bytes behind) yet the time figure tells the true RPO story. |
| Sync vs async treatment | We may read ready when you expect not | We count a sub-one-second async standby as ready; if your policy requires synchronous-only, the drill-down flags whether the qualifying standby is sync or async. |
| Managed abstraction | Possible gap | On RDS / Aurora we read the provider’s reported health, which can lag the actual apply state by the provider’s own publish interval. |
| Source | Expected relationship | What causes divergence |
|---|---|---|
pg_stat_replication on primary | Should match our standby list and states | A standby just connected or just dropped between samples. |
pg_last_xact_replay_timestamp() on standby | Should match our time-based lag | Clock skew between primary and standby hosts skews the comparison; keep NTP tight. |
| Provider HA console (RDS/Aurora/Cloud SQL) | Should agree on standby health | Provider publish interval lags live state by up to a minute. |
Known limitations / FAQs
The primary is perfectly healthy. Why is this card paging me about failover? Because primary health and failover safety are independent. This card warns that if the primary failed right now, you have no good way to recover: no standby is connected, the only one is too far behind, or WAL apply has stalled. A green primary with no promotable standby is a single point of failure. The page is telling you to fix the standby before you need it, which is the only cheap time to do so. What is the difference between this card and Replication Lag? Replication Lag gives you the raw number of seconds a standby is behind. Failover Readiness turns that, plus standby presence, streaming state, and WAL receipt, into a single actionable verdict: can you fail over safely or not. Lag is the input; readiness is the decision. You can have low lag and still be not-ready if the standby just disconnected. My standby shows lag under one second but the card says not ready. Why? Check itsstate. A standby in catchup or backup state is warming up and not yet a valid promotion target even if its current LSN gap looks small. Also confirm WAL is actually being received and replayed: pg_last_wal_receive_lsn() should be advancing. A standby that is “connected” but not applying WAL is not promotable, so readiness correctly reads not-ready.
Does a logical-replication subscriber count as a standby for failover?
No. Logical replication copies selected tables, not the whole cluster, and a subscriber is a full read-write primary in its own right, not a physical standby you can promote to take over the original’s identity and full dataset. Only physical streaming standbys count toward readiness. If your only “replica” is a logical subscriber, the card will read not-ready, which is correct.
On RDS Multi-AZ I never see the standby in pg_stat_replication. Is the card guessing?
RDS Multi-AZ hides the standby; you cannot query it or see it in pg_stat_replication from the primary. On managed HA the engine reads the provider’s reported replica health and apply lag instead. The readiness logic is identical, but the data source is the provider’s metrics rather than the PostgreSQL view. The verdict is as reliable as the provider’s own health reporting.
Why one second and not a more relaxed threshold?
One second is a strict but practical RPO target: it means promotion would lose at most about a second of committed transactions. Teams with looser recovery objectives can raise the sensitivity threshold in the Sensitivity tab so an async DR standby a few seconds behind still counts as ready. The default is deliberately conservative because most teams underestimate how much a few seconds of lost orders costs.
Can this card actually trigger a failover?
No. It is a readiness verdict, not an automation. It tells you whether a safe failover is possible; the actual promotion (pg_ctl promote, a Patroni / repmgr action, or the managed provider’s failover button) is a human or orchestrator decision. The card’s job is to make sure that when you reach for that button, the standby behind it is actually ready.