At a glance
Replication Lag (Seconds_Behind_Source) is how far, in seconds, a replica is behind the source it copies from. Zero means the replica is caught up; a rising number means writes on the source are not yet visible on the replica. Lag is the number that decides whether your read replicas are safe to read from and whether a failover would lose data. A replica 30 seconds behind serves stale catalogue, stale inventory, and stale order state, which is how customers see “out of stock” on something they just bought, or an order that “does not exist yet”.
| What it tracks | The Seconds_Behind_Source value reported by the replica (the field was Seconds_Behind_Master before MySQL 8.0.22). It estimates the delay between an event being written on the source and applied on this replica. |
| Data source | SHOW REPLICA STATUS on each replica, or the equivalent performance_schema.replication_applier_status_by_worker and replication_connection_status tables. The engine reads the worst (highest) lag across all attached replicas for the headline. |
| Time window | RT (real-time, sampled every refresh). |
| Alert trigger | > 10s. When any replica’s Seconds_Behind_Source exceeds 10 seconds the card turns red and a Nerve Centre alert is raised. |
| Why it matters | Stale reads cause user-visible wrong data; deep lag means a failover would promote a replica that is missing the most recent transactions, which is data loss. Lag is both a correctness and a durability signal. |
| Reading the value | Read lag next to Replication Thread Health. Lag that is high but stable is a throughput problem; lag that is NULL means a thread has stopped, which is worse than any number. |
| Sentiment key | mysql_replication_lag_seconds |
| Roles | owner, engineering, operations |
Calculation
The card surfacesSeconds_Behind_Source directly from each replica’s status, then reports the maximum across the topology so a single lagging node cannot hide behind healthy ones.
Seconds_Behind_Source actually measures. It is the difference between the timestamp of the event the SQL (applier) thread is currently executing and the replica’s clock. This has two well-known quirks the engine accounts for:
- It only reflects the applier thread. If the IO (receiver) thread has stopped but the applier is still chewing through its relay log, the replica can briefly report a small, falling number while in reality it has stopped receiving new data. That is why this card is always read with Replication Thread Health.
NULLis not zero. When a replication thread is stopped,Seconds_Behind_SourcereportsNULL, not a large number. The engine treatsNULLas “lag unknown / threads not running” and escalates rather than rendering it as caught up.
Worked example
A platform team runs one MySQL 8.0 source and two read replicas:replica-a serves the catalogue API, replica-b serves analytics. Snapshot taken on 22 Apr 26 at 09:40 BST after a bulk price-update job.
| Node | Seconds_Behind_Source | IO thread | SQL thread | Reading |
|---|---|---|---|---|
| replica-a | 47 s | Yes | Yes | Lagging badly; serving stale catalogue. |
| replica-b | 2 s | Yes | Yes | Healthy. |
- Both threads are running, so this is throughput, not breakage. Replication Thread Health is green. The replica is receiving and applying, just not fast enough to keep up with the source’s write rate.
- The trigger is the bulk price-update job. A single large transaction updating 400,000 catalogue rows landed on the source in seconds, but
replica-aapplies it serially and is now behind.replica-blags less because it is on faster storage. - The customer-visible risk is the catalogue. Because
replica-abacks the catalogue API, shoppers may see the pre-update prices for up to 47 seconds. The team temporarily routes catalogue reads to the source until lag clears, then routes them back.
replica_parallel_workers) so future bulk jobs apply in parallel rather than serially.
Three takeaways:
- Lag is a data-correctness signal, not just a performance one. A lagging replica serves stale rows. If you read inventory or pricing from replicas, lag directly causes wrong answers to customers.
- A stable high number and a
NULLare different emergencies. Stable high lag is a throughput shortfall you can engineer around (parallel workers, faster storage, smaller transactions).NULLmeans a thread stopped, which is the Replication Thread Health emergency. - Big transactions are the usual cause. A single huge
UPDATEorDELETEserialises on the replica even when the source applied it quickly. Batch large DML into smaller chunks to keep lag flat.
Sibling cards
| Card | Why pair it with Replication Lag | What the combination tells you |
|---|---|---|
| Replication Thread Health (IO/SQL) | The thread-state check behind the lag number. | Lag NULL plus a stopped thread equals replication broken, not just slow. |
| Active Replicas | The count of attached replicas. | A drop in replica count plus rising lag equals a replica that disconnected mid-stream. |
| Binlog Backlog (MB) on Primary | The source-side view of unconsumed binlog. | A growing backlog confirms the replica is falling behind from the source’s perspective. |
| Replication Threads Stopped or Lag Exceeds Threshold | The Nerve Centre alert that fires on this condition. | The alert feed entry that pages on-call when lag breaches 10s. |
| Query Latency p99 (ms) | Replica apply competes with read queries. | A busy replica with a fat tail applies the relay log slower, deepening lag. |
| InnoDB Buffer Pool Hit Rate % | Apply speed depends on cache warmth on the replica. | A cold replica buffer pool slows apply and grows lag. |
| MySQL Inventory Rows vs Ecom Inventory Count | The downstream drift caused by stale replica reads. | Lag plus inventory drift equals the storefront reading a stale replica. |
| MySQL Health Score | The composite that weights replication health. | Sustained lag pulls the composite down. |
Reconciling against the source
Where to look in MySQL’s own tooling:Why our number may legitimately differ:SHOW REPLICA STATUSon each replica is the canonical source. Read these fields together:Performance Schema for multi-threaded detail:performance_schema.replication_applier_status_by_workerandreplication_connection_status. GTID delta for a clock-independent measure: compareRetrieved_Gtid_Set(received) againstExecuted_Gtid_Set(applied); the gap is transactions still to apply, immune to clock skew. Managed-service consoles: Amazon RDS exposesReplicaLagin CloudWatch; Aurora exposesAuroraReplicaLagin milliseconds; both should track this card closely.
| Reason | Direction | Why |
|---|---|---|
| Clock skew | Variable | Seconds_Behind_Source is timestamp-based; if source and replica clocks drift, the reported lag is off by the skew. Use the GTID gap for a clock-independent check. |
| Idle source | Reads 0 falsely | When the source receives no writes, Seconds_Behind_Source can read 0 even if the IO thread is stuck, because there is no new event to time. The engine cross-checks thread state. |
| Multi-threaded apply | Engine more accurate | The legacy single field can understate lag with parallel workers; the engine prefers the per-worker Performance Schema tables. |
| NULL handling | Engine escalates | A stopped thread reports NULL; raw tooling shows NULL, the card shows an alert rather than treating it as caught up. |
| Managed-service units | Marginal | Aurora reports lag in milliseconds, RDS in seconds; convert before comparing. |
Known limitations / FAQs
The card reads zero but I am sure the replica is behind. Why? The most common trap: the source is idle.Seconds_Behind_Source is computed from the timestamp of the event currently being applied, so when no new writes arrive there is nothing to time and it reports 0 even if the IO thread is wedged. Always read this card with Replication Thread Health; if a thread is not running, the 0 is a lie.
The card shows an alert but no number. What is happening?
That is NULL, which means a replication thread is stopped. NULL is worse than a large number: a lagging replica is still catching up, but a stopped one will never catch up until you restart the thread. Jump to thread health and the replication-broken alert immediately.
Why does the headline show the worst replica instead of an average?
Because an average hides the failure. If one replica is caught up and one is 200 seconds behind, the average (100s) describes neither. The replica you happen to route a read to is what the customer experiences, so the safe headline is the worst case.
My lag is high but stable, not growing. Is that an emergency?
Less urgent than growing lag, but still a correctness problem. Stable lag means the replica is keeping pace with new writes but never closing the existing gap, usually because it started behind after a restore or a big transaction. It will not self-heal during steady load; you need a quiet window, smaller transactions, or parallel apply to drain it.
Can I change the 10-second alert threshold?
Yes, per profile in the Sensitivity tab. An analytics replica can tolerate minutes of lag; a replica serving live inventory or session state should be tighter, for example 2 to 3 seconds. Set it to just above your normal busy-hour lag.
Why does a single big UPDATE cause a lag spike?
A large transaction commits atomically on the source in one go, but the replica applies it as one unit too, and (without parallel workers) serially behind everything else in the relay log. A 400,000-row update that took 3 seconds on the source can take far longer to clear on a replica, spiking lag. Chunk large DML into smaller batches and enable replica_parallel_workers to mitigate.
Does GTID replication change how I read this card?
The card still surfaces Seconds_Behind_Source, but with GTIDs you have a better backup measure: the gap between Retrieved_Gtid_Set and Executed_Gtid_Set counts transactions, not seconds, and is immune to clock skew. When in doubt about the seconds figure, the GTID gap is the ground truth for “how many transactions behind”.