At a glance
How far behind a replica is, measured in seconds, taken fromsystem.replicas.absolute_delay. In a replicated ClickHouse cluster, every replica replays a shared log of writes to stay in sync with its peers.absolute_delayis the time gap between the newest entry in that log and the point the replica has actually reached: in plain terms, how stale a read served by this replica would be. A healthy replica sits at or near zero. A lag of several seconds means recent writes have not yet landed on this node, so any query routed here returns slightly old data. This card is the time-based companion to Replication Queue Size (Top Tables): the queue tells you how much work is pending, this tells you how much time that work represents.
| Data source | SELECT max(absolute_delay) FROM system.replicas (and the per-table breakdown SELECT database, table, absolute_delay FROM system.replicas ORDER BY absolute_delay DESC). The value is in seconds. |
| What it tracks | The maximum replication delay across the replicated tables on the node, in seconds. The headline surfaces the worst-lagging table so a DBA sees the staleness ceiling first. |
| Metric basis | absolute_delay from system.replicas, computed by ClickHouse as the age of the oldest unprocessed log entry relative to now. It is a wall-clock time, not a task count. |
| Why it matters | Lag is the directly user-visible consequence of replication falling behind. Reads from a lagging replica show stale data: dashboards that read the wrong replica display numbers minutes old, and read-after-write consistency breaks if the same client writes to one replica and reads from another. |
| Time window | RT (real-time; system.replicas reflects the current delay on each refresh). |
| Alert trigger | >10s. A replication delay above ten seconds flags the card amber and pages the on-call DBA. Brief sub-second lag is normal under load; a sustained multi-second delay is the signal. |
| Roles | dba, platform, sre |
Calculation
The engine readsabsolute_delay from system.replicas, which ClickHouse populates per replicated table on each node:
absolute_delay is calculated by ClickHouse as the difference, in seconds, between the current time and the timestamp of the oldest queue entry this replica has not yet processed. When the queue is empty the delay is zero: the replica is fully caught up. When the queue holds an entry that was logged thirty seconds ago and has not been applied, the delay is roughly thirty seconds.
The headline uses the maximum across tables because staleness is a ceiling, not an average: one table lagging twelve seconds means reads of that table are twelve seconds stale even if every other table is current. A DBA needs to see the worst case, then drill into the per-table breakdown to find which table and why.
The 10-second threshold is a practical staleness line. Sub-second lag is normal and constant on a busy cluster: replicas are always a fraction behind the freshest write. Single-digit seconds during a heavy merge or a brief network blip are tolerable. A delay that holds above ten seconds means recent data is meaningfully stale on this replica, which is where dashboards and read-after-write workflows start to misbehave. Because absolute_delay is driven by the replication queue, this card and Replication Queue Size (Top Tables) should be read together: the queue is the cause, the delay is the effect.
Worked example
A platform team runs a three-replica ReplicatedMergeTree cluster serving an analytics product. Read replicas sit behind a load balancer that does not currently prefer in-sync nodes. Snapshot taken on 14 Apr 26 at 17:30 BST.| Replica | Table | absolute_delay | queue_size |
|---|---|---|---|
| replica-01 (leader) | clickstream_repl | 0 s | 0 |
| replica-02 | clickstream_repl | 34 s | 280 |
| replica-03 | clickstream_repl | 2 s | 14 |
- One replica is the problem, not the cluster. The leader is at zero and replica-03 is at a healthy 2 seconds. Only replica-02 is badly behind, with a queue of 280 entries driving it. This is a single-node issue, almost always capacity (slow disk, saturated network) or a transient catch-up after a restart.
- Reads from replica-02 are 34 seconds stale. Because the load balancer fans reads across all three replicas, roughly a third of dashboard queries are hitting a node that is missing 34 seconds of recent events. Users see numbers that flicker between current and stale depending on which replica served them.
- The lag and queue agree. A 34-second delay with a 280-entry queue is internally consistent: the replica has a real backlog. If the delay were high but the queue near zero, that would point at a stuck oldest-entry rather than volume, a different diagnosis.
max_replica_delay_for_distributed_queries was set to 30 seconds, which immediately fenced replica-02 out of the read pool, restoring consistent reads from the two healthy nodes. The root cause was disk I/O saturation on replica-02 during a large merge; once the merge completed the queue drained and the delay returned to under a second. The durable fix was matching replica-02’s storage tier to its peers so it could keep pace.
Three takeaways:
- Lag is staleness, and staleness is user-visible. A high
absolute_delaymeans reads from that replica are out of date. The first move is to stop routing reads to the lagging node, not to debug the cause. - Cross-check lag against queue size. High lag with a large queue is a volume/capacity problem; high lag with a near-empty queue is a stuck oldest-entry. They need different fixes.
- One lagging replica is a capacity or catch-up issue; all replicas lagging is a Keeper or leader problem. If every replica’s delay climbs together, look at the coordination layer or the write rate on the leader, not at any single node.
Sibling cards
| Card | Why pair it with Replication Lag | What the combination tells you |
|---|---|---|
| Replication Queue Size (Top Tables) | The cause behind the delay. | High lag with a large queue equals a real backlog; high lag with an empty queue equals a stuck entry. |
| Active Replicas | How many replicas are participating. | Lag rising as the active-replica count drops means a node fell out and the rest are absorbing its load. |
| Merges In Progress | Heavy merges on a replica raise its delay. | High lag during a merge burst is usually transient and clears when merges finish. |
| Database Disk Usage % | A full or slow disk throttles fetches and grows lag. | High lag plus high disk usage points at storage as the bottleneck. |
| Memory Usage % | Memory pressure slows merge and fetch throughput. | High lag plus high memory equals a resource-starved replica. |
| Active Parts (Top 10 Tables) | Lagging merge replication lets parts accumulate. | High lag on merges plus rising parts equals a backlog that can lead to TOO_MANY_PARTS. |
| ClickHouse Health Score | The composite that weights replication lag. | A sustained lag breach pulls the composite down. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Read the delay directly fromWhy our number may legitimately differ from a manual query:system.replicasinclickhouse-client:Cross-check the leader and log pointers withSELECT is_leader, log_pointer, log_max_index, last_queue_update FROM system.replicas. For a stuck oldest-entry, inspectsystem.replication_queue(num_tries,last_exception). On ClickHouse Cloud, replication is service-managed, so this card targets self-managed ReplicatedMergeTree clusters; Cloud surfaces replica sync state in its own monitoring view.
| Reason | Direction | Why |
|---|---|---|
| Snapshot timing | Higher or lower | absolute_delay is computed against now(), so it ticks upward every second a backlog persists. A query run moments later shows a different value. |
| Replica scope | Card reflects one node | The card reads the configured node; absolute_delay is per replica, so another replica may show a very different lag. Query each, or use clusterAllReplicas. |
| Max vs per-table | Card shows the worst case | The headline uses the maximum delay across tables; a manual query against one table sees only that table’s lag. |
| Clock skew | Marginal | absolute_delay depends on the node’s clock; significant NTP drift between nodes can distort the value. Keep clocks synchronised. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| ClickHouse Event Ingest vs Ecom Orders | If a read replica lags, ingested events appear delayed on that replica’s reads even though orders are flowing. | Orders steady but analytics looks behind equals a lagging read replica, not a broken pipeline. |
Known limitations / FAQs
What exactly does absolute_delay measure? It is the age, in seconds, of the oldest replication-log entry this replica has not yet applied, measured against the current time. Put simply, it is how stale a read from this replica would be. Zero means fully caught up; ten means recent writes from the last ten seconds have not yet landed here. My lag is a fraction of a second all the time. Is that bad? No, that is healthy. On a busy cluster replicas are always a hair behind the freshest write, so sub-second lag is the normal steady state. The card flags only sustained delay above ten seconds. Persistent single-digit-millisecond or sub-second lag means replication is keeping pace. Lag is high but the queue is nearly empty. How is that possible? This is the stuck-oldest-entry case.absolute_delay is driven by the oldest unprocessed entry, so a single entry that cannot complete (a fetch for a part that no longer exists on the source, for example) keeps the delay climbing even though there is almost nothing else queued. Inspect system.replication_queue for an entry with a high num_tries and a repeating last_exception, and repair the underlying cause.
How do I stop users seeing stale data while I fix the lag?
Fence the lagging replica out of the read path. For distributed queries set max_replica_delay_for_distributed_queries so replicas beyond a delay threshold are skipped, and use load_balancing = 'in_order' (or first_or_random) to prefer the in-sync nodes. This makes reads consistent immediately, buying time to address the underlying capacity or stuck-entry issue.
All my replicas are lagging at once. What does that point to?
A shared cause. The two usual suspects are the coordination layer (Keeper/ZooKeeper under latency, or a degraded quorum, slowing every replica’s log processing) and the leader’s write rate (inserts arriving faster than any replica can replay). Check Keeper health and the leader’s insert rate before treating it as a per-node problem.
Does clock skew affect this number?
Yes. Because absolute_delay is computed relative to now() on the node reading it, a node whose clock drifts from its peers will report a distorted delay. Keep NTP tight across the cluster; a sudden lag jump on one node with no queue growth can be a clock problem rather than a replication problem.
On ClickHouse Cloud, is replication lag my responsibility?
On ClickHouse Cloud the service manages replication, so the Cloud monitoring view is the primary place to watch sync state. This card is most useful for self-managed ReplicatedMergeTree clusters where you own the replica hardware and the Keeper/ZooKeeper layer and need to diagnose and clear lag yourself.