At a glance
How far behind the primary each streaming standby is, measured in seconds. Replication lag is the gap between “data committed on the primary” and “data visible on the replica”. When lag is zero (or sub-second) your read replicas are serving fresh data and a failover would lose nothing. When lag grows, two risks appear at once: read replicas start returning stale data to your application, and your recovery point objective (RPO) widens, because a primary failure now loses every transaction the standby has not yet caught up on. For a DBA this is the single most important number on a high-availability cluster.
| Data source | pg_stat_replication on the primary (one row per connected standby) and pg_last_xact_replay_timestamp() on each standby. Vortex IQ reports the time-based lag: now() - pg_last_xact_replay_timestamp() on the standby, cross-checked against the byte lag derived from sent_lsn, replay_lsn on the primary. On RDS / Aurora the value is reconciled with the provider’s ReplicaLag / AuroraReplicaLag CloudWatch metric. |
| Metric basis | Time-based replay lag in seconds, the most operationally meaningful form. Byte lag (see WAL Lag Bytes) is tracked separately because a quiet primary can show zero byte lag but rising time lag, and vice versa. |
| Aggregation window | RT: real-time, refreshed roughly every 60 seconds. The headline shows the worst (highest) lag across all standbys; per-standby detail is available on drill-down. |
| Unit | Seconds. |
| What counts | Every standby that appears in pg_stat_replication with a recognised state (streaming, catchup, backup). Cascading replicas chained off another standby are included where reachable. |
| What does NOT count | (1) Logical replication subscribers using pg_stat_subscription are tracked on their own; this card is physical streaming replication. (2) A standby that has fully disconnected (no row in pg_stat_replication) is not “lagging”, it is missing, and surfaces on Replication Lag Exceeds Threshold or Standby Unreachable. (3) Archive-based (WAL shipping) recovery lag is reported separately. |
| Time window | RT (real-time, refreshed every 60 seconds, worst-standby headline) |
| Alert trigger | >10s. Sustained lag above 10 seconds means read replicas are serving meaningfully stale data and your RPO has widened past what most HA designs tolerate. |
| Roles | owner, engineering, operations |
Calculation
PostgreSQL exposes replication progress through log sequence numbers (LSNs), monotonic pointers into the write-ahead log. The primary records, per standby, the LSN it has sent, the LSN the standby has written, the LSN the standby has flushed to disk, and the LSN the standby has replayed into the visible database. Vortex IQ derives lag two ways and reports the more operationally meaningful one. Time-based lag (the headline). Run on each standby:ReplicaLag metric so a CloudWatch alarm and the Nerve Centre card agree.
Worked example
A high-availability cluster backs an order-management system: one primary, two streaming standbys (standby-a used for read traffic, standby-b reserved as the failover target). At 21:40 on 03 Jun 26 the SRE on call gets a Nerve Centre alert: replication lag has crossed 10 seconds and is climbing.
The per-standby snapshot:
| Standby | Role | Time lag | Byte lag | State |
|---|---|---|---|---|
| standby-a | read replica | 47s | 1.9 GB | streaming |
| standby-b | failover target | 2s | 40 MB | streaming |
standby-a is lagging, and both its time lag and byte lag are high. That pattern (high-and-high on one standby, healthy on the other) rules out a primary-side cause: if the primary were the problem, both standbys would lag together. The issue is specific to standby-a.
max_standby_streaming_delay set high, PostgreSQL chose to delay WAL replay rather than cancel the query, so the replica fell behind. The SRE had two levers: cancel the reporting query (replay resumes, lag drains in under a minute) or lower max_standby_streaming_delay so future long queries get cancelled instead of stalling replay. They cancelled the query; lag on standby-a dropped to 1s within 90 seconds.
Three takeaways the team recorded:
- The failover target lagging is a different severity from a read replica lagging.
standby-bat 2s means a failover would lose ~2 seconds of writes, acceptable. Ifstandby-bhad been the one at 47s, the RPO exposure would be the headline and pre-empting a failover would be the priority. Always read which standby is lagging, not just the worst number. - A long query on a hot standby can stall replay. This is PostgreSQL-specific: the replica must choose between honouring a long read query and applying conflicting WAL.
max_standby_streaming_delaycontrols the trade-off. If your read replicas run heavy reporting, expect lag spikes unless you tune this. - Time lag without byte lag is usually benign. Had the primary simply been idle overnight, the time lag would climb while byte lag stayed near zero. That is not an incident; it is arithmetic. The high-and-high pattern is what makes this a real event.
Sibling cards to read alongside
| Card | Why pair it with Replication Lag | What the combination tells you |
|---|---|---|
| WAL Lag Bytes (primary to standby) | The byte-based view of the same gap. | High time lag plus high byte lag equals a standby that genuinely cannot keep up; high time, low byte equals an idle primary. |
| Active Streaming Replicas | Counts how many standbys are connected at all. | If a standby vanishes from the count, “lag” becomes “unreachable”, a worse state. |
| Failover Readiness | Whether any standby is fresh enough to promote safely. | Lag above 10s on your failover target drops readiness; this is the RPO consequence made explicit. |
| Replication Lag Exceeds Threshold or Standby Unreachable | The alert wrapper around this metric. | The KPI shows the trend; the alert card shows the active breach and which standby. |
| Query Latency p99 (ms) | A long query on a hot standby can stall replay. | p99 spike on the replica plus rising lag equals a reporting query blocking WAL apply. |
| Database Disk Usage % | A lagging standby forces the primary to retain WAL. | Rising lag plus rising primary disk equals WAL accumulating because the standby has not consumed it. |
| Last Successful Backup (hours ago) | The other half of your recovery posture. | Lag widens RPO in real time; backup age widens it on the cold-recovery path. Read together for total exposure. |
| PostgreSQL Health Score | Replication health is a weighted component. | Sustained lag drops the composite even when latency and errors look fine. |
Reconciling against the source
Where to look in PostgreSQL’s own tooling:On the primary, the master view isOn managed services:pg_stat_replication:Thereplay_laginterval column (PostgreSQL 10+) is the primary’s own estimate of time-based lag and is the closest native equivalent to this card. On each standby,SELECT now() - pg_last_xact_replay_timestamp();gives the replica’s view of its own staleness, andSELECT pg_is_in_recovery();confirms it is still a standby.
Amazon RDS: theWhy our number may legitimately differ from a raw read:ReplicaLagCloudWatch metric (seconds) is the native equivalent for read replicas. Aurora PostgreSQL usesAuroraReplicaLag(milliseconds), which is typically far lower because Aurora replicates at the storage layer, not via WAL streaming. Google Cloud SQL: thedatabase/replication/replica_lagmetric in Cloud Monitoring. Azure Database for PostgreSQL: thephysical_replication_delay_in_secondsmetric.
| Reason | Direction | Why |
|---|---|---|
| Idle primary | Raw now() - replay_timestamp looks worse | On a quiet primary, time lag grows with no real WAL to ship. The card cross-checks byte lag and annotates “primary idle” so the headline is not misread. |
| Aurora storage replication | Aurora reads far lower | Aurora does not stream WAL between instances; AuroraReplicaLag measures storage-layer lag in ms, not the WAL-replay seconds this card models for vanilla streaming. |
| Clock skew | Variable | Time-based lag depends on the standby’s clock matching the primary’s. Significant NTP drift distorts the seconds figure; the byte cross-check is clock-independent. |
| Worst-standby headline | Card may look worse than one standby | The headline is the highest lag across all standbys; a single slow replica sets it even if the failover target is fresh. Drill down for per-standby detail. |
| Sampling cadence | Brief smoothing | The card samples every ~60s; a sub-minute spike that drains immediately may be smoothed. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
datadog.replication-lag | If Datadog’s PostgreSQL integration scrapes the same pg_stat_replication, the two should agree closely. | Different scrape intervals or Datadog using byte lag vs this card’s time lag. |
| Provider CloudWatch / Cloud Monitoring alarm | Should fire in step with this card’s >10s trigger. | Provider metric granularity (1-minute CloudWatch resolution) can lag a real-time card by up to a minute. |
Known limitations / FAQs
My replication lag shows several seconds but byte lag is near zero. Is this a real problem? Almost certainly not. That pattern means the primary is idle: no new WAL is being generated, so the “age of the last replayed transaction” simply grows with wall-clock time. The replica is fully caught up; there is just nothing new to apply. Vortex IQ annotates this as “primary idle” on drill-down. The lag will drop to near zero the instant write traffic resumes. Watch the high-time-and-high-byte combination, that is the real event. Lag spikes whenever a big report runs on my read replica. Why, and how do I stop it? This is PostgreSQL’s standby replay-conflict behaviour. A long read query on a hot standby conflicts with WAL records that would remove rows the query still needs. Withmax_standby_streaming_delay set high, PostgreSQL pauses WAL replay to let the query finish, and lag grows. With it set low, PostgreSQL cancels the query (you see “canceling statement due to conflict with recovery”). Tune max_standby_streaming_delay to your tolerance, or enable hot_standby_feedback so the primary holds off vacuuming rows the replica needs, at the cost of some bloat on the primary.
What is the difference between this card and WAL Lag Bytes?
This card measures lag in seconds (time-based: how stale is the replica’s data). WAL Lag Bytes measures it in bytes (how much WAL is still queued to ship). Time lag answers “how old is my replica’s data and what is my RPO?”; byte lag answers “how much data is in flight?”. They diverge on an idle or bursty primary, which is exactly why both are tracked.
Aurora reports almost no lag but vanilla PostgreSQL with the same load lags seconds. Why?
Aurora does not use WAL streaming between instances. All Aurora replicas read from the same shared storage volume, so replication is a storage-layer operation measured in milliseconds (AuroraReplicaLag). Vanilla streaming replication ships and replays WAL over the network, which is inherently slower and load-sensitive. The two are not comparable; the card models them with the appropriate native metric per platform.
A standby disappeared entirely. Does that show as infinite lag?
No. A standby that drops out of pg_stat_replication is no longer “lagging”, it is unreachable, which is a distinct and more serious state. That condition surfaces on Replication Lag Exceeds Threshold or Standby Unreachable and drops Active Streaming Replicas. This card only reflects standbys that are still connected.
The 10-second alert default is wrong for my use case.
Tune it in the Sensitivity tab. A synchronous-replication cluster might want a sub-second threshold because any lag indicates trouble; an analytics read replica where five-minute staleness is acceptable might want 60 seconds or more. Set the threshold against the staleness your application can actually tolerate and the RPO your business has signed off on.
Does sustained lag put my primary at risk, not just the replica?
It can. If a standby falls far behind, the primary must retain the WAL the standby has not yet consumed (especially with a replication slot configured). That WAL accumulates in pg_wal and consumes primary disk. A badly lagging standby with a slot can, in the worst case, fill the primary’s disk and take down the whole cluster. Watch Database Disk Usage % on the primary whenever lag is high and a slot is in use.