At a glance
An alert card that fires when any replica-set member falls more than 10 seconds behind the primary, or reports astateStrofRECOVERING. Lag is the gap between the primary’s latest write and the secondary’s last applied write, derived fromrs.status()optimes. ARECOVERINGmember is one that cannot currently serve reads or vote normally because it is catching up or rebuilding. Either condition means your replica set has lost redundancy: a secondary that is too far behind cannot be safely promoted on failover, and a recovering member is effectively out of the set. For a platform team this is a durability and availability warning, the cluster is one more failure away from trouble.
| What it tracks | Active alerts where any member’s replication lag exceeds 10 seconds, or any member reports stateStr = RECOVERING. Each entry lists the member, its state, and its current lag. |
| Data source | rs.status() member documents: optimeDate per member (lag = primary optime minus member optime) and stateStr. |
| Time window | RT (real-time, evaluated on every live poll). |
| Alert trigger | any member lag >10s OR stateStr=RECOVERING. |
| Roles | DBA, platform, SRE |
Calculation
The card readsrs.status() and evaluates two independent conditions across every member:
optimeDate) and subtracting each secondary’s last applied operation time. The result is how many seconds of writes the secondary has not yet replayed. A healthy secondary on a healthy network sits at well under a second of lag; sustained lag above 10 seconds means the secondary cannot keep up with the primary’s write rate, or the link between them is degraded.
State is read directly from each member’s stateStr. The normal healthy states are PRIMARY and SECONDARY. RECOVERING is a transient-but-concerning state: the member is applying its oplog to catch up after a restart, a network partition, or because it fell so far behind that it had to resync. While RECOVERING, the member does not serve reads and does not count as a healthy voting secondary.
Either condition alone raises the alert because both reduce the effective health of the set. A laggy secondary risks data loss on failover (promoting it loses the un-replicated writes); a recovering member shrinks the pool of nodes available to take over. The card lists which members are affected so you can tell whether you are one failure from degraded or one failure from down.
Worked example
A platform team runs a three-node replica set (one primary, two secondaries) behind an order service. Snapshot taken on 09 Jun 26 at 02:55 BST, during an overnight bulk-import job.| Member | stateStr | optime lag vs primary | Health |
|---|---|---|---|
| mongo-prod-01 | PRIMARY | 0s | healthy |
| mongo-prod-02 | SECONDARY | 34s | alerting (lag) |
| mongo-prod-03 | RECOVERING | n/a | alerting (state) |
mongo-prod-02 is 34 seconds behind, and mongo-prod-03 is RECOVERING. With the primary healthy but both secondaries impaired, the set has effectively lost its redundancy: if the primary failed right now, neither secondary could be safely and immediately promoted.
What the team reads from this:
- The lag on mongo-prod-02 is import-driven. The 34-second lag started when the overnight bulk-import job began hammering the primary with writes. The secondary is replaying the oplog as fast as its disk allows but cannot keep pace with the import burst. This usually self-resolves once the import finishes, but while it persists the secondary is not a safe failover target.
- mongo-prod-03 being RECOVERING is the bigger concern. A member enters
RECOVERINGafter a restart or after falling so far behind that it must catch up from the oplog (or resync entirely if the oplog has rolled over). The team needs to confirm it is making forward progress and not stuck; a member stuck inRECOVERINGbecause the primary’s oplog no longer covers its last optime requires an initial sync to recover. - The set is one failure from degraded service. With only the primary fully healthy, a primary failure now would either fail over to a still-catching-up node (risking lost writes) or, worse, leave no eligible voting majority to elect a new primary. This is the durability warning the card exists to raise.
- Lag and RECOVERING are both redundancy losses, treated as one alert. They have different causes but the same consequence: a member that cannot safely take over. The card fires on either so you never miss a degraded set.
- Bulk writes are the most common lag driver. Imports, migrations, and large batch updates flood the oplog faster than secondaries can replay it. Throttling the write job or running it during a low-traffic window keeps secondaries in step.
- A stuck RECOVERING member needs hands-on intervention. If the member is not advancing toward
SECONDARY, check whether the primary’s oplog still covers its last optime. If not, an initial sync (resync) is required, and the oplog should be sized larger to prevent recurrence.
Sibling cards
| Card | Why pair it with Replica Set Member Lag / RECOVERING | What the combination tells you |
|---|---|---|
| Replica Lag (seconds) | The continuous lag gauge this alert thresholds. | Watch lag trend toward 10s before the alert fires; gives early warning during imports. |
| Replica Set Members (state) | The full per-member state table. | Shows every member’s stateStr at once so you can see how much of the set is healthy. |
| Elections (24h) | Lag and RECOVERING members often precede or follow elections. | Frequent elections plus lagging members equals an unstable set flapping its primary. |
| Query Error Rate Spike (>1% in 5m) | Elections during failover produce NotWritablePrimary errors. | An error spike that coincides with a RECOVERING member is failover-driven, not a code bug. |
| Operations per Second (live) | High write throughput drives secondary lag. | An ops spike on the primary that coincides with rising lag equals the secondaries cannot keep pace. |
| Chunks Pending Migration | On sharded clusters, each shard is its own replica set. | Pending migrations plus member lag equals a shard whose replica set is struggling under balancer load. |
| MongoDB Health Score | The composite that weights replica-set health. | Any lagging or recovering member pulls the overall score down. |
Reconciling against the source
Where to look in MongoDB’s own tooling:RunWhy our number may legitimately differ from a manual reading:rs.status()inmongoshconnected to any member. Themembersarray gives each node’sstateStrandoptimeDate; subtract a secondary’soptimeDatefrom the primary’s to get its lag. Runrs.printSecondaryReplicationInfo()for a human-readable per-secondary lag summary. Inspect the oplog window withrs.printReplicationInfo()to confirm the oplog still covers a recovering member’s last optime; if not, an initial sync is required. On MongoDB Atlas, the Metrics tab exposes Replication Lag and Replication Oplog Window charts per node, and Atlas raises its own native replica-set health alerts.
| Reason | Direction | Why |
|---|---|---|
| Sample timing | Either | Lag moves second to second under write load; our poll and your rs.status() run are taken at slightly different instants. |
| Clock skew | Either | Lag is computed from optime timestamps; if node clocks drift, the computed lag can read slightly high or low. Keep NTP in sync. |
| Which node you query | Either | rs.status() reports the querying node’s view of the set; a partitioned node may report stale state for members it cannot reach. |
| State transience | Our card may show RECOVERING briefly | A member can flick through RECOVERING for seconds during a normal restart; the card surfaces it, which is correct, even though it may clear on its own. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
shopify.total_revenue / bigcommerce.total_revenue | An election triggered by an unhealthy set can cause a brief order-failure dip. | A revenue dip coinciding with a RECOVERING member and an election equals failover impact on the storefront. |
| Application read-preference behaviour | Apps reading from secondaries may serve stale data while lag is high. | If users report stale reads during a lag alert, the app is reading from a lagging secondary; pin critical reads to the primary. |
Known limitations / FAQs
A secondary briefly went RECOVERING during a routine restart and the alert fired. Is that a false alarm? Not a false alarm, but expected behaviour. A member legitimately passes throughRECOVERING while it applies its oplog to catch up after a restart. The card surfaces it because, during that window, the member genuinely cannot serve reads or be safely promoted. If it clears back to SECONDARY within a minute or two, no action is needed beyond noting it. A member stuck in RECOVERING is the real problem.
My lag spikes every night during a batch import. How do I stop the alert?
The lag is real: your secondaries cannot replay the oplog as fast as the import writes it. Options, in order of preference: throttle the import (smaller batches, pauses between them), run it during a lower-traffic window, or increase secondary I/O capacity. As a last resort you can widen the sensitivity threshold, but that hides a genuine redundancy gap during the import.
What is the difference between a lagging secondary and a RECOVERING one?
A lagging SECONDARY is still applying the oplog and serving reads; it is just behind. A RECOVERING member is not serving reads at all, it is catching up before it can rejoin as a healthy secondary. Lag is a matter of degree; RECOVERING is a binary “this node is out of service right now”. Both reduce failover safety, which is why the card fires on either.
Why is 10 seconds the threshold?
Sub-second lag is normal and healthy. Single-digit seconds of lag is tolerable during write bursts. Beyond about 10 seconds, the secondary is far enough behind that promoting it on failover would lose a meaningful number of writes, and stale reads from that secondary become noticeable. 10 seconds is the line where lag stops being routine and starts threatening durability. It is configurable per profile in the Sensitivity tab.
A member is stuck in RECOVERING and not catching up. What now?
Check whether the primary’s oplog still covers the member’s last applied optime, using rs.printReplicationInfo(). If the oplog has rolled past that point, the member cannot catch up incrementally and needs an initial sync (a full resync from another member). To prevent recurrence, size the oplog larger so it covers longer outages, and investigate why the member fell so far behind in the first place.
Does high lag mean I will lose data?
Only if the primary fails while a secondary is lagging and that secondary is promoted before catching up. With the default majority write concern (w: majority), acknowledged writes are safe because they are confirmed on a majority before the client is told they succeeded. Lag is most dangerous for writes made with weaker write concern, or for reads from a lagging secondary returning stale data. Use w: majority for durability-critical writes and pin critical reads to the primary.
Does this card work on a sharded cluster?
Yes. Each shard in a sharded cluster is itself a replica set, and the card evaluates the lag and state of every member across every shard. A RECOVERING member or a laggy secondary on any single shard raises the alert for that shard, so you can pinpoint which shard’s replica set is impaired rather than treating the whole cluster as one unit.