> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vortexiq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# WAL Lag Bytes (primary -> standby), PostgreSQL

> WAL Lag Bytes (primary -> standby) for PostgreSQL instances. Tracked live in Vortex IQ Nerve Centre. How to read it, why it matters, and how to act on it.

**Card class:** [Sensitivity](/nerve-centre/overview#card-classes-explained)  •  **Category:** [Autovacuum & Bloat](/nerve-centre/connectors#connectors-by-type)

## At a glance

> **WAL Lag Bytes** is the volume of write-ahead log, measured in bytes, that the primary has generated but a streaming standby has not yet received or replayed. It is computed as `pg_wal_lsn_diff(primary_lsn, standby_lsn)`: the byte distance between the primary's current write position and where the standby has caught up to. A small, stable number is healthy. A number that climbs and keeps climbing means the standby is falling behind, and the further behind it gets, the more data you stand to lose in a failover and the longer a promotion will take to replay.

|                           |                                                                                                                                                                                                                                                                                                    |
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Source columns**        | `pg_stat_replication` on the primary (`sent_lsn`, `write_lsn`, `flush_lsn`, `replay_lsn`) compared against `pg_current_wal_lsn()`, differenced with `pg_wal_lsn_diff()`. The card surfaces the gap to each connected standby.                                                                      |
| **Metric basis**          | Byte distance in the WAL stream, not time. Bytes are the truth source; seconds-of-lag (see [Replication Lag (seconds)](/nerve-centre/kpi-cards/postgresql/replication-lag-seconds)) is derived and can read zero on an idle primary even when bytes are outstanding.                               |
| **What "lag" means here** | By default the card measures the send/flush gap (bytes not yet shipped to the standby). The replay gap (bytes shipped but not yet applied) is tracked separately and surfaced when it diverges from the send gap, which is the signature of a standby that is receiving fine but replaying slowly. |
| **Aggregation window**    | Real-time, sampled every refresh cycle. Sustained growth across consecutive samples is the concerning pattern, not a single spike during a bulk write.                                                                                                                                             |
| **Multiple standbys**     | One reading per connected standby. The headline shows the worst (largest) lag across all standbys, since failover readiness is gated by your best-positioned replica.                                                                                                                              |
| **Time window**           | `RT` (real-time, sampled every refresh cycle).                                                                                                                                                                                                                                                     |
| **Alert trigger**         | `> 1GB` of outstanding WAL to any standby. A gigabyte of unshipped WAL is roughly the point where a `wal_keep_size` overrun and slot bloat become real risks.                                                                                                                                      |
| **Roles**                 | DBA, platform engineering, SRE.                                                                                                                                                                                                                                                                    |

## Calculation

The engine queries `pg_stat_replication` on the primary and, for each connected standby, computes the byte gap with `pg_wal_lsn_diff()`:

```sql theme={null}
SELECT
  application_name,
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)   AS send_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)  AS flush_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
```

`send_lag_bytes` is what the headline reports by default (bytes the primary has not yet sent). `flush_lag_bytes` adds the bytes received but not yet durably written on the standby. `replay_lag_bytes` adds the bytes written but not yet applied. The three values are nested: replay lag is always greater than or equal to flush lag, which is greater than or equal to send lag. When the replay gap balloons while the send gap stays small, the network is fine but the standby cannot keep up with apply, often because a long-running read query on the standby is blocking WAL replay (a hot-standby conflict).

A LSN (log sequence number) is a position in the WAL stream expressed as a hex pair such as `3A/7F2C8B40`. `pg_wal_lsn_diff()` subtracts two LSNs and returns the byte distance, so the card is reading the literal number of bytes between two points in the log.

## Worked example

A platform team runs a primary with two streaming standbys: `standby-a` (same availability zone, serves read traffic) and `standby-b` (cross-region, kept for disaster recovery). Snapshot taken on 14 Apr 26 at 09:40 BST during a nightly bulk reindex job.

| Standby   | State     | Send lag | Flush lag | Replay lag |
| --------- | --------- | -------- | --------- | ---------- |
| standby-a | streaming | 12 MB    | 14 MB     | 18 MB      |
| standby-b | streaming | 1.4 GB   | 1.4 GB    | 1.4 GB     |

The card headline reads **1.4 GB** (the worst standby) and the sensitivity threshold of `> 1GB` has tripped on `standby-b`. The team reads the two rows very differently:

1. **standby-a is healthy.** Twelve megabytes of send lag with eighteen megabytes of replay lag is normal churn during a bulk write. The replay gap is only marginally larger than the send gap, so apply is keeping pace. No action.
2. **standby-b has fallen behind across the board.** Send, flush, and replay are all stuck at 1.4 GB and equal to each other. Because the send gap itself is large, this is not a slow-apply problem on the standby; it is a shipping problem. The bytes are not leaving the primary fast enough for the cross-region link.

```text theme={null}
Sizing the exposure on standby-b:
  - Outstanding WAL: 1.4 GB and rising ~120 MB/min during the reindex
  - wal_keep_size on the primary: 2 GB
  - Headroom before the primary recycles WAL the standby still needs: 0.6 GB (~5 min at current rate)
  - If the slot is non-reserved and headroom is exhausted: standby-b needs a full re-sync
  - Replication slot in use (reserved): primary will retain WAL instead, but pg_wal volume grows
```

The action is time-boxed: the team either throttles the reindex, raises `max_wal_size` and confirms the replication slot is reserving WAL (so the standby is not abandoned), or accepts that `standby-b` will need a rebuild. The DR standby being 1.4 GB behind means a failover to it right now would lose the last several minutes of writes, which is the recovery-point objective (RPO) breach this card exists to catch.

Three takeaways:

1. **Bytes, not seconds, are the honest measure.** On an idle primary the seconds-of-lag card can read zero while gigabytes sit unshipped, because no new commits are arriving to timestamp. Always read WAL lag in bytes when you are sizing failover risk.
2. **The shape of the three numbers tells you where the bottleneck is.** Large send gap = shipping/network problem. Small send gap but large replay gap = apply problem on the standby, usually a hot-standby query conflict.
3. **A reserved replication slot is double-edged.** It guarantees the primary keeps WAL the standby still needs, which protects the standby, but if the standby never catches up the primary's `pg_wal` directory grows without bound and you risk filling the data disk. Pair this card with [Database Disk Usage %](/nerve-centre/kpi-cards/postgresql/database-disk-usage).

## Sibling cards

| Card                                                                                                                                                    | Why pair it with WAL Lag Bytes                           | What the combination tells you                                                                                                                       |
| ------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Replication Lag (seconds)](/nerve-centre/kpi-cards/postgresql/replication-lag-seconds)                                                                 | The time-based view of the same gap.                     | Bytes high but seconds low equals an idle primary with unshipped WAL; bytes and seconds both high equals a standby genuinely behind on live traffic. |
| [Replication Lag Exceeds Threshold or Standby Unreachable](/nerve-centre/kpi-cards/postgresql/replication-lag-exceeds-threshold-or-standby-unreachable) | The alert-list escalation of this metric.                | When WAL lag crosses threshold or a standby drops to `state=BROKEN`, this is where the page fires.                                                   |
| [Active Streaming Replicas](/nerve-centre/kpi-cards/postgresql/active-streaming-replicas)                                                               | The topology count.                                      | If a standby disappears from the replica count, its WAL lag stops being reported, which can mask the problem rather than resolve it.                 |
| [Failover Readiness](/nerve-centre/kpi-cards/postgresql/failover-readiness)                                                                             | The promotion-readiness composite.                       | High WAL lag on every standby means no replica is safe to promote without data loss.                                                                 |
| [Database Disk Usage %](/nerve-centre/kpi-cards/postgresql/database-disk-usage)                                                                         | The disk pressure a reserved slot creates.               | A reserved slot feeding a stuck standby grows `pg_wal`; watch disk as lag persists.                                                                  |
| [PostgreSQL Health Score](/nerve-centre/kpi-cards/postgresql/postgresql-health-score)                                                                   | The executive composite that weights replication health. | Sustained WAL lag pulls the composite down even while latency and errors look fine.                                                                  |

## Reconciling against the source

**Where to look in PostgreSQL directly:**

> Run `SELECT application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes FROM pg_stat_replication;` on the **primary** for the per-standby byte gap.
> On a **standby**, `SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();` shows received vs replayed positions locally.
> Check the configured replication slots with `SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots;` to see how much WAL the primary is retaining for each consumer.

**Why our number may legitimately differ from a raw query:**

| Reason                     | Direction | Why                                                                                                                                                                                                         |
| -------------------------- | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Sample timing**          | Marginal  | The card samples on its refresh cycle; a query you run by hand a second later sees a different LSN, especially under heavy write load.                                                                      |
| **Send vs replay basis**   | Variable  | The headline defaults to the send gap; a query selecting `replay_lsn` reports the larger replay gap. Compare like for like.                                                                                 |
| **Worst-standby headline** | Higher    | The card shows the largest lag across all standbys; a query filtered to one standby shows only that standby's gap.                                                                                          |
| **Managed-service metric** | Variable  | On RDS / Aurora the console's `ReplicaLag` CloudWatch metric is reported in seconds, not bytes, and is computed differently from `pg_wal_lsn_diff`. Treat it as a corroborating signal, not an exact match. |

**On managed services:** Amazon RDS and Aurora expose replica lag through the CloudWatch `ReplicaLag` / `AuroraReplicaLag` metrics (seconds) and the read-replica list in the RDS console; Cloud SQL surfaces `replication/replica_lag` in Cloud Monitoring; Azure Database for PostgreSQL exposes `physical_replication_delay_in_bytes`, which is the closest managed-service match to this card. Use the byte-based metric where the provider offers one.

## Known limitations / FAQs

**The seconds-of-lag card reads zero but WAL Lag Bytes shows 800 MB. Which is right?**
Both are right; they measure different things. Seconds-of-lag is derived from the timestamp of the last replayed transaction. On an idle or low-write primary there are no fresh commits to timestamp, so the seconds value collapses to zero even though unshipped bytes exist. Bytes are the honest measure of how much data a failover would lose. When sizing recovery-point risk, trust the bytes.

**My send gap is tiny but the replay gap is huge. What does that mean?**
The standby is receiving WAL fine but cannot apply it fast enough. The usual cause is a hot-standby conflict: a long-running read query on the standby holds a snapshot that blocks WAL replay (PostgreSQL pauses apply rather than cancelling the query, unless `max_standby_streaming_delay` forces a cancellation). Check `pg_stat_activity` on the standby for long-running queries, and review `max_standby_streaming_delay`.

**Why is the threshold 1 GB rather than a time?**
Bytes are the basis the card actually measures, and a byte threshold behaves consistently regardless of write rate. One gigabyte is the point where typical `wal_keep_size` settings start to overrun and where a reserved replication slot begins materially growing the primary's `pg_wal` directory. Tune it to your own `wal_keep_size` and disk headroom in the Sensitivity tab.

**A standby vanished from the card entirely. Is lag zero now?**
No, it is unknown, which is worse. `pg_stat_replication` only lists connected standbys, so a disconnected standby produces no row and no lag reading. The disappearance itself is the alarm. Cross-check with [Active Streaming Replicas](/nerve-centre/kpi-cards/postgresql/active-streaming-replicas) and the standby-unreachable alert.

**Does a replication slot make this safe to ignore?**
A reserved replication slot guarantees the primary will not recycle WAL the standby still needs, so the standby will eventually catch up rather than requiring a rebuild. But it shifts the risk: the retained WAL accumulates in the primary's `pg_wal` directory and can fill the data disk if the standby never recovers. Watch disk usage whenever lag persists with a reserved slot.

**Can WAL lag be negative?**
Briefly, a standby's reported LSN can appear ahead of the value sampled from the primary because of sampling skew between the two reads. The engine clamps such transient negatives to zero; a persistent negative would indicate a clock or sampling fault and is treated as a no-read.

***

### Tracked live in Vortex IQ Nerve Centre

*WAL Lag Bytes (primary -> standby)* is one of hundreds of KPI pulses Vortex IQ tracks across PostgreSQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English.

[Start for free](https://app.vortexiq.ai/login) or [book a demo](https://www.vortexiq.ai/contact-us) to see this metric running on your own data.
