At a glance
The total size, in megabytes, of binary-log data that the primary has written but the slowest replica has not yet consumed. It measures replication backlog by volume rather than by time: how much data is queued waiting to ship to and be applied by replicas. A small, stable backlog is healthy. A backlog that keeps growing means a replica is falling behind faster than it can catch up, and beyond a point the primary risks purging binlogs the replica still needs, which breaks replication outright.
| Status source | Computed from SHOW BINARY LOGS (file names and sizes on the primary) and the slowest replica’s read position. The backlog is the sum of binlog bytes ahead of that replica’s Read_Source_Log_Pos across the relevant files. |
| Metric basis | Bytes of binlog written but not yet read by the slowest replica, converted to MB. This is a volume measure, complementary to the time measure Seconds_Behind_Source. |
| Aggregation window | Real-time, evaluated on each sample against the current set of replicas. |
| Alert threshold | > 1GB (1024 MB) of un-consumed binlog. Above this the engine raises the card to amber in the sensitivity feed. |
| What counts | Binlog bytes between the slowest replica’s read position and the current write position on the primary, across all binlog files in that range. |
| What does NOT count | (1) Binlog already read by every replica (the backlog is measured against the slowest one); (2) binlog files already purged; (3) relay-log data already applied on the replica; (4) data still in the primary’s in-memory binlog cache before flush. |
| Common causes | A write burst on the primary outpacing a single-threaded apply on the replica; a replica with a stopped thread (backlog grows without bound); a slow network link between primary and replica; an under-provisioned replica that cannot apply as fast as the primary writes. |
| Time zone | Volume is time-zone independent; chart axes render in the merchant display time zone set in the Vortex IQ profile. |
| Time window | RT (real-time). |
| Alert trigger | > 1GB of binlog backlog. |
| Roles | dba, platform, sre |
Calculation
The engine takes two readings each sample. From the primary,SHOW BINARY LOGS returns every binlog file and its size, and SHOW MASTER STATUS (or SHOW BINARY LOG STATUS on 8.4+) gives the current write file and position. From each replica’s SHOW REPLICA STATUS it reads Source_Log_File and Read_Source_Log_Pos (the IO thread’s fetch position). The backlog against a given replica is:
Worked example
A platform team runs a primary on MySQL 8.0 with two replicas, one of which (Replica B) is under-provisioned and applies single-threaded. A bulk catalogue re-import kicks off at 22:00 on 11 Jun 26, writing heavily for an hour. Snapshot series:| Sample time | Primary write rate | Replica B read pos vs primary | Backlog (MB) |
|---|---|---|---|
| 21:55 | 8 MB/min | 12 MB behind | 12 |
| 22:05 | 95 MB/min | 240 MB behind | 240 |
| 22:20 | 95 MB/min | 690 MB behind | 690 |
| 22:35 | 95 MB/min | 1,180 MB behind | 1,180 |
| 22:50 | 95 MB/min | 1,610 MB behind | 1,610 |
> 1GB threshold around 22:35 and reads 1,610 MB backlog on Replica B by 22:50, amber and climbing. Replica A (multi-threaded, well-provisioned) stays under 50 MB throughout, so the card reports B’s figure because B is the worst case.
The DBA’s read:
- The backlog is growing, not just large. A steady 1,610 MB would be a tolerated batch artefact. A backlog rising ~430 MB every 15 minutes means B is consuming binlog slower than the primary produces it: it is losing ground and will keep losing ground until the write burst ends or B speeds up.
- The real danger is binlog retention. If
binlog_expire_logs_secondsis set low, the primary may purge a binlog file before B has read it. The instant that happens, B’s IO thread errors with 1236 (“could not find first log file”) and replication breaks, requiring a full re-clone. Backlog growth is the early warning for this. - Two levers, short and long. Short term: ensure the primary retains binlogs long enough to cover the catch-up (raise
binlog_expire_logs_seconds, and do not let disk pressure force an early purge). Long term: enable multi-threaded apply on B (replica_parallel_workers) or right-size the instance so a write burst like this drains in minutes, not hours.
- Volume and time are different lenses on the same lag. Time (
Seconds_Behind_Source) tells you how stale the replica’s data is; volume (backlog MB) tells you how much work is queued and, crucially, how close you are to a purge-induced break. Watch both. - The slope is the signal. A large but flat backlog during a known batch is tolerable. A backlog with a positive slope is a replica losing the race, and it will not fix itself until the write rate drops or apply speed rises.
- Backlog is the leading indicator of a 1236 break. The most damaging replication failure (binlog purged before the replica read it) is preceded by exactly this growth. Acting on the amber here prevents a re-clone later.
Sibling cards
| Card | Why pair it with Binlog Backlog | What the combination tells you |
|---|---|---|
| Replication Lag (Seconds_Behind_Source) | The time-based view of the same lag. | Volume rising with time confirms genuine fall-behind; volume rising while time looks flat warns of an impending purge break. |
| Replication Threads Stopped or Lag Exceeds Threshold | The hero alert for thread health. | A stopped thread makes backlog grow without bound; this card quantifies how far behind the frozen replica has fallen. |
| Replication Thread Health (IO/SQL) | Distinguishes fetch lag from apply lag. | IO ahead of SQL means the data is fetched but not applied; backlog measured at the IO position shows fetch-side queue. |
| Active Replicas | The set the backlog is measured against. | The card reports the worst replica; this confirms how many replicas exist and which is lagging. |
| Database Disk Usage % | Binlog retention competes for disk. | High disk usage can force early binlog purge, which is what turns backlog growth into a 1236 break. |
| Queries per Second (live) | The write pressure feeding the backlog. | A QPS/write burst explains backlog growth and predicts when it will drain. |
| Last Successful Backup (hours ago) | The fallback if a replica must be re-cloned. | If backlog leads to a break, backup freshness sets the cost of rebuilding the replica. |
Reconciling against the source
Where to look in MySQL itself:On the primary:To compute the byte gap by hand, sum the sizes of the binlog files between the replica’sSHOW BINARY LOGS;for every binlog file and its size, andSHOW MASTER STATUS;(orSHOW BINARY LOG STATUS;on 8.4+) for the current write file and position. On each replica:SHOW REPLICA STATUS\GforSource_Log_FileandRead_Source_Log_Pos(the fetch position the backlog is measured against).SELECT * FROM performance_schema.replication_connection_status\Gfor the modern view of the IO thread’s progress.SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';andSHOW VARIABLES LIKE 'max_binlog_size';to understand retention and file rollover.
Source_Log_File/Read_Source_Log_Pos and the primary’s current file/position.
Why our number may legitimately differ from a manual calculation:
| Reason | Direction | Why |
|---|---|---|
| Read vs apply position | Card uses read pos | The card measures against the IO thread’s read position (purge safety); a calculation using the SQL apply position would show a larger gap. |
| Worst-replica selection | Card = max | The card reports the slowest replica; a check against a faster replica will show less backlog. |
| In-flight binlog cache | Card may lag a hair | Data still in the primary’s binlog cache before flush is not yet in SHOW BINARY LOGS sizes. |
| Sampling moment | Marginal | During a heavy write burst the figure moves between samples; the card shows the value at sample time. |
ReplicaLag/AuroraReplicaLag (time-based) plus the binlog retention setting (call mysql.rds_set_configuration('binlog retention hours', N) on RDS) as the proxy. On Google Cloud SQL, replication is monitored via seconds_behind_master; binlog volume is inferred from binary-log storage growth. On all managed services the purge-safety concern still applies: ensure binlog retention exceeds your worst replica’s catch-up time.
Known limitations / FAQs
Why measure backlog in MB when I already have Seconds_Behind_Source? They answer different questions.Seconds_Behind_Source tells you how stale the replica’s data is in time. Backlog MB tells you how much binlog volume is queued, which is what governs purge safety. A replica can be only a few seconds behind in time yet have hundreds of MB queued during a write burst, and it is the volume, not the seconds, that determines whether the primary can safely purge a binlog file. Reading both gives you the complete picture.
The backlog is large but flat. Is that a problem?
Usually not. A large, stable backlog during a known heavy-write window (a batch import, a bulk update) is the replica working through a queue at a steady rate. The dangerous shape is a growing backlog, which means the replica is consuming slower than the primary produces and will keep falling behind. Watch the slope, not just the height.
How does a growing backlog actually break replication?
If the backlog grows large enough that the primary purges a binlog file the replica has not yet read, often because binlog_expire_logs_seconds is too low or disk pressure forces an early purge, the replica’s IO thread errors with 1236 (“could not find first log file”). Replication then stops dead and the replica must be re-cloned. Backlog growth is the early warning; raising binlog retention and speeding up apply prevents the break.
Why does the card report the worst replica rather than an average?
Because the slowest replica is the one that matters for both risk concerns: it gates how aggressively the primary can purge binlogs (you must retain anything the slowest replica has not read), and it is the worst failover candidate. An average would hide a single badly lagging replica behind several healthy ones. The card surfaces the worst case so you act on the real risk.
Can I reduce backlog by speeding up the replica?
Yes, that is the primary long-term lever. Enable multi-threaded apply with replica_parallel_workers (and replica_parallel_type = LOGICAL_CLOCK) so the replica applies independent transactions in parallel instead of single-threaded. Right-sizing the replica’s CPU and IO, and ensuring it is not also serving heavy read traffic during write bursts, also help. On the primary side, breaking very large transactions into smaller batches reduces the apply stalls that let backlog build.
Does a stopped replica thread show up here?
Yes, and dramatically. If a replica’s IO or SQL thread stops, it stops consuming binlog entirely, so the backlog against it grows without bound until someone intervenes. This card will climb steadily; cross-reference Replication Threads Stopped or Lag Exceeds Threshold, which fires the hero alert on the stopped thread itself.
Can I change the 1GB threshold?
Yes, it is configurable per profile in the Sensitivity tab. Instances that routinely run large overnight batches may have a higher healthy backlog and want the threshold raised so the card only fires on genuine fall-behind. Set it above your normal batch peak but well below the point where binlog retention would risk a purge break.