Skip to main content
Card class: SensitivityCategory: Autovacuum & Bloat

At a glance

WAL Lag Bytes is the volume of write-ahead log, measured in bytes, that the primary has generated but a streaming standby has not yet received or replayed. It is computed as pg_wal_lsn_diff(primary_lsn, standby_lsn): the byte distance between the primary’s current write position and where the standby has caught up to. A small, stable number is healthy. A number that climbs and keeps climbing means the standby is falling behind, and the further behind it gets, the more data you stand to lose in a failover and the longer a promotion will take to replay.
Source columnspg_stat_replication on the primary (sent_lsn, write_lsn, flush_lsn, replay_lsn) compared against pg_current_wal_lsn(), differenced with pg_wal_lsn_diff(). The card surfaces the gap to each connected standby.
Metric basisByte distance in the WAL stream, not time. Bytes are the truth source; seconds-of-lag (see Replication Lag (seconds)) is derived and can read zero on an idle primary even when bytes are outstanding.
What “lag” means hereBy default the card measures the send/flush gap (bytes not yet shipped to the standby). The replay gap (bytes shipped but not yet applied) is tracked separately and surfaced when it diverges from the send gap, which is the signature of a standby that is receiving fine but replaying slowly.
Aggregation windowReal-time, sampled every refresh cycle. Sustained growth across consecutive samples is the concerning pattern, not a single spike during a bulk write.
Multiple standbysOne reading per connected standby. The headline shows the worst (largest) lag across all standbys, since failover readiness is gated by your best-positioned replica.
Time windowRT (real-time, sampled every refresh cycle).
Alert trigger> 1GB of outstanding WAL to any standby. A gigabyte of unshipped WAL is roughly the point where a wal_keep_size overrun and slot bloat become real risks.
RolesDBA, platform engineering, SRE.

Calculation

The engine queries pg_stat_replication on the primary and, for each connected standby, computes the byte gap with pg_wal_lsn_diff():
SELECT
  application_name,
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)   AS send_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)  AS flush_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
send_lag_bytes is what the headline reports by default (bytes the primary has not yet sent). flush_lag_bytes adds the bytes received but not yet durably written on the standby. replay_lag_bytes adds the bytes written but not yet applied. The three values are nested: replay lag is always greater than or equal to flush lag, which is greater than or equal to send lag. When the replay gap balloons while the send gap stays small, the network is fine but the standby cannot keep up with apply, often because a long-running read query on the standby is blocking WAL replay (a hot-standby conflict). A LSN (log sequence number) is a position in the WAL stream expressed as a hex pair such as 3A/7F2C8B40. pg_wal_lsn_diff() subtracts two LSNs and returns the byte distance, so the card is reading the literal number of bytes between two points in the log.

Worked example

A platform team runs a primary with two streaming standbys: standby-a (same availability zone, serves read traffic) and standby-b (cross-region, kept for disaster recovery). Snapshot taken on 14 Apr 26 at 09:40 BST during a nightly bulk reindex job.
StandbyStateSend lagFlush lagReplay lag
standby-astreaming12 MB14 MB18 MB
standby-bstreaming1.4 GB1.4 GB1.4 GB
The card headline reads 1.4 GB (the worst standby) and the sensitivity threshold of > 1GB has tripped on standby-b. The team reads the two rows very differently:
  1. standby-a is healthy. Twelve megabytes of send lag with eighteen megabytes of replay lag is normal churn during a bulk write. The replay gap is only marginally larger than the send gap, so apply is keeping pace. No action.
  2. standby-b has fallen behind across the board. Send, flush, and replay are all stuck at 1.4 GB and equal to each other. Because the send gap itself is large, this is not a slow-apply problem on the standby; it is a shipping problem. The bytes are not leaving the primary fast enough for the cross-region link.
Sizing the exposure on standby-b:
  - Outstanding WAL: 1.4 GB and rising ~120 MB/min during the reindex
  - wal_keep_size on the primary: 2 GB
  - Headroom before the primary recycles WAL the standby still needs: 0.6 GB (~5 min at current rate)
  - If the slot is non-reserved and headroom is exhausted: standby-b needs a full re-sync
  - Replication slot in use (reserved): primary will retain WAL instead, but pg_wal volume grows
The action is time-boxed: the team either throttles the reindex, raises max_wal_size and confirms the replication slot is reserving WAL (so the standby is not abandoned), or accepts that standby-b will need a rebuild. The DR standby being 1.4 GB behind means a failover to it right now would lose the last several minutes of writes, which is the recovery-point objective (RPO) breach this card exists to catch. Three takeaways:
  1. Bytes, not seconds, are the honest measure. On an idle primary the seconds-of-lag card can read zero while gigabytes sit unshipped, because no new commits are arriving to timestamp. Always read WAL lag in bytes when you are sizing failover risk.
  2. The shape of the three numbers tells you where the bottleneck is. Large send gap = shipping/network problem. Small send gap but large replay gap = apply problem on the standby, usually a hot-standby query conflict.
  3. A reserved replication slot is double-edged. It guarantees the primary keeps WAL the standby still needs, which protects the standby, but if the standby never catches up the primary’s pg_wal directory grows without bound and you risk filling the data disk. Pair this card with Database Disk Usage %.

Sibling cards

CardWhy pair it with WAL Lag BytesWhat the combination tells you
Replication Lag (seconds)The time-based view of the same gap.Bytes high but seconds low equals an idle primary with unshipped WAL; bytes and seconds both high equals a standby genuinely behind on live traffic.
Replication Lag Exceeds Threshold or Standby UnreachableThe alert-list escalation of this metric.When WAL lag crosses threshold or a standby drops to state=BROKEN, this is where the page fires.
Active Streaming ReplicasThe topology count.If a standby disappears from the replica count, its WAL lag stops being reported, which can mask the problem rather than resolve it.
Failover ReadinessThe promotion-readiness composite.High WAL lag on every standby means no replica is safe to promote without data loss.
Database Disk Usage %The disk pressure a reserved slot creates.A reserved slot feeding a stuck standby grows pg_wal; watch disk as lag persists.
PostgreSQL Health ScoreThe executive composite that weights replication health.Sustained WAL lag pulls the composite down even while latency and errors look fine.

Reconciling against the source

Where to look in PostgreSQL directly:
Run SELECT application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes FROM pg_stat_replication; on the primary for the per-standby byte gap. On a standby, SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); shows received vs replayed positions locally. Check the configured replication slots with SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots; to see how much WAL the primary is retaining for each consumer.
Why our number may legitimately differ from a raw query:
ReasonDirectionWhy
Sample timingMarginalThe card samples on its refresh cycle; a query you run by hand a second later sees a different LSN, especially under heavy write load.
Send vs replay basisVariableThe headline defaults to the send gap; a query selecting replay_lsn reports the larger replay gap. Compare like for like.
Worst-standby headlineHigherThe card shows the largest lag across all standbys; a query filtered to one standby shows only that standby’s gap.
Managed-service metricVariableOn RDS / Aurora the console’s ReplicaLag CloudWatch metric is reported in seconds, not bytes, and is computed differently from pg_wal_lsn_diff. Treat it as a corroborating signal, not an exact match.
On managed services: Amazon RDS and Aurora expose replica lag through the CloudWatch ReplicaLag / AuroraReplicaLag metrics (seconds) and the read-replica list in the RDS console; Cloud SQL surfaces replication/replica_lag in Cloud Monitoring; Azure Database for PostgreSQL exposes physical_replication_delay_in_bytes, which is the closest managed-service match to this card. Use the byte-based metric where the provider offers one.

Known limitations / FAQs

The seconds-of-lag card reads zero but WAL Lag Bytes shows 800 MB. Which is right? Both are right; they measure different things. Seconds-of-lag is derived from the timestamp of the last replayed transaction. On an idle or low-write primary there are no fresh commits to timestamp, so the seconds value collapses to zero even though unshipped bytes exist. Bytes are the honest measure of how much data a failover would lose. When sizing recovery-point risk, trust the bytes. My send gap is tiny but the replay gap is huge. What does that mean? The standby is receiving WAL fine but cannot apply it fast enough. The usual cause is a hot-standby conflict: a long-running read query on the standby holds a snapshot that blocks WAL replay (PostgreSQL pauses apply rather than cancelling the query, unless max_standby_streaming_delay forces a cancellation). Check pg_stat_activity on the standby for long-running queries, and review max_standby_streaming_delay. Why is the threshold 1 GB rather than a time? Bytes are the basis the card actually measures, and a byte threshold behaves consistently regardless of write rate. One gigabyte is the point where typical wal_keep_size settings start to overrun and where a reserved replication slot begins materially growing the primary’s pg_wal directory. Tune it to your own wal_keep_size and disk headroom in the Sensitivity tab. A standby vanished from the card entirely. Is lag zero now? No, it is unknown, which is worse. pg_stat_replication only lists connected standbys, so a disconnected standby produces no row and no lag reading. The disappearance itself is the alarm. Cross-check with Active Streaming Replicas and the standby-unreachable alert. Does a replication slot make this safe to ignore? A reserved replication slot guarantees the primary will not recycle WAL the standby still needs, so the standby will eventually catch up rather than requiring a rebuild. But it shifts the risk: the retained WAL accumulates in the primary’s pg_wal directory and can fill the data disk if the standby never recovers. Watch disk usage whenever lag persists with a reserved slot. Can WAL lag be negative? Briefly, a standby’s reported LSN can appear ahead of the value sampled from the primary because of sampling skew between the two reads. The engine clamps such transient negatives to zero; a persistent negative would indicate a clock or sampling fault and is treated as a no-read.

Tracked live in Vortex IQ Nerve Centre

WAL Lag Bytes (primary -> standby) is one of hundreds of KPI pulses Vortex IQ tracks across PostgreSQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.