At a glance
The fraction of the last 5-minute window during which the Galera cluster was throttling writes via flow control, derived from wsrep_flow_control_paused. Galera keeps every node in sync, so when one node falls behind applying the write-set queue, it sends flow-control messages that pause writes cluster-wide until it catches up. A high value means a single slow node is acting as a brake on the entire cluster: every other node, however fast, is being held back to the speed of the slowest. For a DBA this is the early-warning signal that a node is struggling before it drops out entirely.
| Status variable | wsrep_flow_control_paused from SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_paused'. A float between 0 and 1: the fraction of time replication was paused since the counter was last reset. |
| Metric basis | Time-paused ratio, NOT a count. The card renders it as a percentage of the 5-minute window. It measures cluster-wide write throttling caused by flow control, not query latency. |
| Aggregation window | 5 minutes. The card reads the paused ratio over a rolling 5-minute window so a brief blip does not dominate. |
| Healthy value | At or near 0%. A well-balanced cluster spends essentially no time paused. |
| What drives it up | (1) A node with slow disk I/O applying write-sets too slowly; (2) a node mid-SST/IST as a joiner; (3) a node desynced as a donor; (4) an undersized gcs.fc_limit against a heavy write burst; (5) a long-running transaction blocking apply. |
| What does NOT drive it | Read load (reads do not replicate), async-replica lag, or router latency. Flow control is strictly about the synchronous write-set apply queue. |
| Time window | 5m (rolling 5-minute paused ratio) |
| Alert trigger | >10%, when the cluster spends more than a tenth of the window throttled, write throughput is materially degraded. |
| Roles | owner, engineering, operations |
Calculation
Galera maintainswsrep_flow_control_paused as the cumulative fraction of time replication has been paused since the value was last reset (a reset happens on FLUSH STATUS or instance restart). Because the raw variable is cumulative, the card samples it across the 5-minute window and reports the rate of pausing over that window rather than the lifetime average:
wsrep_flow_control_sent (how many flow-control pause messages this node sent) tells you which node is the brake, which is the first thing to check when this card alerts.
Worked example
A platform team runs a 3-node MariaDB Galera cluster. Two nodes sit on NVMe storage; a third, db-galera-03, was recently re-provisioned onto a slower network-attached volume to save cost. On 03 Jun 26 at 19:40 BST a nightly bulk-import job kicks off, generating a heavy write burst.| Node | Storage | wsrep_flow_control_sent (last 5m) | Apply keeping up? |
|---|---|---|---|
| db-galera-01 | NVMe | 0 | Yes |
| db-galera-02 | NVMe | 0 | Yes |
| db-galera-03 | network volume | 412 | No, falling behind |
- One slow node is throttling the whole cluster. Only db-galera-03 is sending flow-control messages (412 of them). The two NVMe nodes could apply the import easily, but Galera pauses them so the slow node does not fall out of sync. The cluster is running at the speed of its weakest disk.
- The symptom is cluster-wide, the cause is local. Write latency is up on every node and the application sees slower commits everywhere, but the fix is on db-galera-03 alone: faster storage, a tuned
innodb_flush_log_at_trx_commit, or a larger apply-thread count (wsrep_slave_threads). - This is a precursor, not yet an outage. At 18% the cluster is degraded but whole. If db-galera-03 keeps slipping, it can eventually fall so far behind that it leaves the Primary Component, which would show up on Galera Cluster Size. Acting now (at the flow-control stage) avoids the harder node-loss recovery later.
Sibling cards to reference together
| Card | Why pair it with Flow Control Paused % | What the combination tells you |
|---|---|---|
| Galera Cluster Size | Sustained flow control can precede a node dropping out. | Rising pause % then a size drop equals the slow node finally left the cluster. |
| Galera Cluster Status | The outcome if a throttled node fully falls behind. | Pause % climbing toward a Non-Primary flip is the danger path. |
| Query Latency p95 (ms) | Flow control shows up as slower commits everywhere. | High pause % plus high write p95 equals the cluster is throttled, not query-bound. |
| Connection Pool Saturation % | Paused writes hold connections open longer. | Pause % up plus pool saturation up equals throttling is backing up the connection pool. |
| InnoDB / XtraDB Buffer Pool Hit Rate % | A cold or undersized node applies write-sets slowly. | Low hit rate on the slow node explains why it cannot keep up. |
| MariaDB Health Score | The composite that weights flow control. | Sustained pausing drags the composite down before any outage. |
| Slow-Query Rate % | Distinguishes throttling from genuinely slow queries. | High pause % with low slow-query rate confirms the cause is replication, not SQL. |
Reconciling against the source
Where to look in MariaDB’s own tooling:RunWhy our number may legitimately differ from a manual query:SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_paused';for the cumulative paused ratio the card derives from. RunSHOW GLOBAL STATUS LIKE 'wsrep_flow_control_sent';andLIKE 'wsrep_flow_control_recv';per node to find which node is sending the pause messages. RunSHOW GLOBAL STATUS LIKE 'wsrep_local_recv_queue%';to see the apply-queue backlog on the slow node. On a managed service, the provider’s Galera metrics view exposes the same flow-control series.
| Reason | Direction | Why |
|---|---|---|
| Cumulative vs windowed | Manual read usually lower | The raw variable is the lifetime average since reset; the card reports the 5-minute rate, which spikes higher during a burst than the long-run average. |
| Counter reset timing | Variable | FLUSH STATUS or a restart resets the cumulative counter, after which the raw variable starts near 0; the card’s windowed value is unaffected by where in the lifetime you read. |
| Which node you query | Can differ | Each node reports its own paused ratio; they are usually close but a donor node mid-SST reads higher. |
| Poll timing | Brief lag | A burst between polls is reflected on the next refresh. |
| Source | Expected relationship | What causes divergence |
|---|---|---|
wsrep_flow_control_sent per node | The node with the highest sent count is the throttling node. | If every node sends roughly equal amounts, the bottleneck is shared (network or global write load), not a single slow node. |
wsrep_local_recv_queue_avg | Should be elevated on the throttling node when pause % is high. | A high paused ratio with empty queues everywhere is unusual and worth deeper investigation. |
Known limitations / FAQs
The card shows 0% almost all the time. Is the metric working? Yes, and 0% is the healthy reading. A balanced cluster with adequately-provisioned nodes spends virtually no time in flow control. You should only see meaningful values during write bursts, joiner SST/IST, or when a node is genuinely struggling. A flat 0% means your nodes are keeping up with each other. How do I find which node is causing the pause? Comparewsrep_flow_control_sent across all nodes. The node sending the most flow-control messages is the brake, it is the one asking the others to slow down because its apply queue is backing up. Then check that node’s disk I/O, wsrep_local_recv_queue, and wsrep_slave_threads to understand why it cannot keep pace.
Can I just turn flow control off to make the card green?
You can relax it by raising gcs.fc_limit, but turning it off effectively is dangerous. Flow control is what keeps nodes from diverging; without it a slow node falls arbitrarily far behind and either runs out of memory holding the receive queue or gets evicted from the cluster. The right fix is to make the slow node faster, not to remove the safety mechanism.
Is flow control the same as replication lag?
No, and the distinction matters. Galera is synchronous, so it does not have async-style lag; instead, when a node would lag, flow control pauses everyone so the apply queue drains. So in Galera you see flow control, not lag. Async replicas attached downstream are a different mechanism, tracked by Async Replication Lag (seconds).
A joiner node is causing 30% pause during SST. Is that an emergency?
It is expected, not an emergency, but it is worth managing. While a node performs a State Snapshot Transfer it cannot apply live write-sets, so flow control can spike. Use a non-blocking SST method (mariabackup) so the donor stays available, and schedule joins outside peak write windows. The pause should fall back to near 0% once the joiner reaches Synced.
Why a 5-minute window rather than real-time?
Flow control is bursty: a single large transaction can pause writes for a fraction of a second. A 5-minute window smooths those harmless blips while still catching a node that is sustainably slow. The 10% alert threshold is set against this window so it fires on a real, ongoing brake rather than transient noise.
Does this card exist for standalone MariaDB?
No. wsrep_flow_control_paused only exists when the Galera (wsrep) provider is loaded. On a standalone server there is no synchronous cluster to throttle, so the card is not applicable.