At a glance
Replication Thread Health (IO/SQL) is the binary “is replication actually running?” check. A MySQL replica runs two threads: the IO (receiver) thread that pulls binlog events from the source into a local relay log, and the SQL (applier) thread that replays those events into the replica’s data. Both must reportYes. If either isNo, replication is broken: the replica stops receiving or stops applying changes, and from that moment it drifts further from the source with every write. This is the card that catches replication failure the instant it happens, before lag has time to balloon.
| What it tracks | The running state of both replication threads on each replica: Replica_IO_Running and Replica_SQL_Running (named Slave_IO_Running / Slave_SQL_Running before MySQL 8.0.22). Both must be Yes for the card to read healthy. |
| Data source | SHOW REPLICA STATUS on each replica, or performance_schema.replication_connection_status (IO thread) and replication_applier_status (SQL thread). The engine reports the worst node. |
| Time window | RT (real-time, sampled every refresh so a stopped thread surfaces within one cycle). |
| Alert trigger | IO_THREAD or SQL_THREAD stopped. If either thread on any replica is not Yes, the card turns red and a Nerve Centre alert is raised. |
| Why it matters | This is MySQL-distinctive: either thread stopping means replication is broken, full stop. A stopped IO thread means no new events arrive; a stopped SQL thread means events arrive but are never applied. Both lead to silent, growing data divergence and an unsafe failover target. |
| Reading the value | Read with Replication Lag. A stopped thread makes lag report NULL; the thread card tells you which thread and usually why (it carries the last error). |
| Sentiment key | mysql_replication_thread_health |
| Roles | owner, engineering, operations |
Calculation
The card is a logical AND across both threads on every replica. There is no averaging and no percentile: replication is either running or it is not.Connectingis notYes. The IO thread can sit in aConnectingstate when it is trying and failing to reach the source (bad credentials, network, or the source is down). The engine treats anything other thanYesas not healthy, so a flappingConnectingthread raises the alert rather than being read as a transient.- The error fields carry the cause. When the SQL thread stops,
Last_SQL_ErrorandLast_SQL_Errnoexplain why (a duplicate-key collision, a missing table, a statement the replica cannot apply). When the IO thread stops,Last_IO_Errorexplains the connection problem. The engine surfaces these so the alert is actionable, not just “broken”.
Worked example
A platform team runs a MySQL 8.0 source with two read replicas. Snapshot taken on 18 Apr 26 at 02:15 BST, shortly after a nightly maintenance job ran a manualDELETE directly on replica-a (a mistake: writes should only happen on the source).
| Node | Replica_IO_Running | Replica_SQL_Running | Last_SQL_Errno | Reading |
|---|---|---|---|---|
| replica-a | Yes | No | 1062 | SQL thread stopped on a duplicate-key error. |
| replica-b | Yes | Yes | 0 | Healthy. |
- The IO thread is fine, the SQL thread is stopped.
replica-ais still receiving events into its relay log, but it has stopped applying them. Lag will now climb without bound until the SQL thread is restarted, because nothing is being applied. Last_SQL_Errnois 1062, a duplicate-key error. The manualDELETEremoved a row, then the binlog from the source tried to apply aDELETEfor a row that no longer existed, or a laterINSERTcollided. The replica’s data and the source’s binlog have diverged.- The decision is correctness over speed. Skipping the offending transaction (
SET GLOBAL sql_slave_skip_counter) would restart replication but leave the replica permanently inconsistent with the source. The team instead rebuildsreplica-afrom a fresh source snapshot so it is guaranteed consistent, and routes its catalogue traffic toreplica-bin the meantime.
Yes, and the card clears. The follow-up is to lock down replica write access so application credentials cannot run DML directly against a replica.
Three takeaways:
- A stopped thread is a different emergency from high lag. Lag self-heals when the workload eases; a stopped thread never self-heals. It will sit broken until a human intervenes, and the divergence grows the whole time.
- The error number tells you whether you can recover or must rebuild. Transient errors (lost connection, source restart) usually resolve on a thread restart. Data-collision errors (1062 duplicate key, 1032 row not found) mean the replica has diverged, and the safe fix is a rebuild, not a skip.
- Never write to a replica. The single most common cause of a stopped SQL thread is a stray write directly on the replica. Enforce
read_only/super_read_onlyand scope application credentials so only the source accepts DML.
Sibling cards
| Card | Why pair it with Replication Thread Health | What the combination tells you |
|---|---|---|
| Replication Lag (Seconds_Behind_Source) | Lag reports NULL the moment a thread stops. | Thread broken plus lag NULL equals replication is dead, not slow. |
| Active Replicas | The count of attached replicas. | A replica with a stopped IO thread may drop off the count entirely. |
| Binlog Backlog (MB) on Primary | The source-side binlog the IO thread consumes. | A stopped IO thread means the source’s binlog backlog grows because it is not being consumed. |
| Replication Threads Stopped or Lag Exceeds Threshold | The Nerve Centre alert feed for this exact condition. | The paging entry that wakes on-call when a thread stops. |
| MySQL Health Score | The composite that weights replication health heavily. | A broken thread should visibly drop the composite. |
| Last Successful Backup (hours ago) | Recovery depends on a recent, consistent backup. | A broken replica plus a stale backup equals limited recovery options. |
| MySQL Inventory Rows vs Ecom Inventory Count | The downstream drift if a replica serves storefront reads. | A stopped applier means the storefront reads ever-staler inventory. |
| Query Error Rate % | Apps may error when they hit a broken or stale replica. | Thread broken plus rising query errors equals the app is failing reads against the dead replica. |
Reconciling against the source
Where to look in MySQL’s own tooling:Why our number may legitimately differ:SHOW REPLICA STATUS\Gon each replica is the canonical source. The fields that matter:Performance Schema for the structured view:performance_schema.replication_connection_status(IO/receiver) andperformance_schema.replication_applier_status_by_worker(SQL/applier, per worker). The replica’s error log for the full stack around the failure, often with more context than the singleLast_SQL_Errorline. Managed-service consoles: Amazon RDS surfaces replication state and will raise aReplicationevent; Aurora reports replica state in the cluster members view. These should agree with this card.
| Reason | Direction | Why |
|---|---|---|
Connecting state | Engine stricter | The engine treats Connecting (IO thread retrying) as not healthy; a quick glance at the UI might read it as “trying”, but it is not Yes. |
| Multi-worker applier | Engine more precise | The single Replica_SQL_Running field can mask a single failed worker; the engine reads per-worker status from Performance Schema. |
| Auto-restart windows | Brief disagreement | Some managed services auto-restart a stopped thread; the card may show broken for one refresh before the platform recovers it. |
| Field naming | None if mapped | Pre-8.0.22 uses Slave_* names; the engine maps both, but a manual query on an old server uses the old field names. |
| Worst-node headline | Variable | The card reports the worst replica; a per-node manual check on a healthy replica will disagree with the broken-node headline. |
Known limitations / FAQs
The card says broken but the replica seems to be serving reads fine. How? A replica with a stopped SQL thread still answers queries; it just answers them from increasingly stale data. The IO thread may even still be running, filling the relay log. Everything looks alive until someone notices the data is old. That is precisely why this card exists: replication can be broken while the replica appears healthy. Which is worse, a stopped IO thread or a stopped SQL thread? Both are emergencies, but they fail differently. A stopped IO thread means no new events reach the replica at all (usually a connection, credential, or source-availability problem, often transient). A stopped SQL thread means events are arriving but cannot be applied (usually a data collision, which means the replica has diverged and may need a rebuild). The IO thread is more often recoverable with a restart; the SQL thread more often signals real divergence. Can I just skip the bad transaction to get replication running again? You can (sql_slave_skip_counter or injecting an empty GTID), and sometimes it is right for a known-benign event. But skipping a data-collision error leaves the replica permanently inconsistent with the source: every future read is potentially wrong, and the replica is unsafe to promote. For collision errors the safe path is a rebuild from a fresh source snapshot. Skip only when you understand exactly what you are dropping.
Why did my SQL thread stop with a duplicate-key (1062) error?
Almost always because something wrote directly to the replica. A row was inserted or deleted on the replica out of band, then the source’s binlog tried to apply a conflicting change. Enforce super_read_only on all replicas and scope application credentials so DML only ever hits the source.
The thread shows Connecting and never reaches Yes. What now?
The IO thread cannot establish a session with the source. Check, in order: network path and firewall to the source port, the replication user’s credentials and host grant, whether the source is up and accepting connections, and whether the source has purged a binlog the replica still needs (which requires re-seeding the replica). Last_IO_Error usually names the specific cause.
Does this card cover Group Replication or InnoDB Cluster?
This card reads the classic asynchronous / semi-synchronous replica threads. Group Replication exposes its member state through performance_schema.replication_group_members instead, which is a different model. If you run InnoDB Cluster, read member health from that table; the IO/SQL thread card applies to traditional source-replica topologies.
Can I change what counts as healthy?
The healthy condition (both threads Yes) is intrinsic to MySQL replication and is not configurable: there is no “partially running” that is safe. What you can configure in the Sensitivity tab is the paging behaviour and whether Connecting raises immediately or after a short grace period, useful if your network has known brief blips during source restarts.