At a glance
The average time a single refresh operation takes, in milliseconds, computed as indices.refresh.total_time_in_millis / indices.refresh.total. A refresh is what makes newly indexed documents searchable: it flushes the in-memory buffer into a new Lucene segment. A climbing average means refreshes are getting slower, which usually means segments are stacking up faster than merges can consolidate them, or disk I/O is struggling to keep pace. Slow refreshes delay how quickly new or updated documents appear in search results and are an early warning of indexing-side strain.
| API basis | Index stats, GET /_stats/refresh (or GET /_nodes/stats/indices/refresh). The two counters are refresh.total_time_in_millis (cumulative time spent refreshing) and refresh.total (count of refresh operations). |
| Metric basis | A ratio of cumulative time over cumulative count, computed as a delta over the window so it reflects recent refresh cost, not the all-time average since cluster start. |
| Aggregation window | 1h rolling. The card takes the change in both counters over the last hour and divides, giving the average refresh duration for that hour. |
| Alert threshold | > 1000ms. A refresh that averages over one second means segment creation has become expensive, typically from segment proliferation or disk pressure. |
| Default refresh interval | Elasticsearch refreshes every 1s by default per index (index.refresh_interval). This card measures how long each refresh takes, not how often it runs; the two interact (a longer interval means fewer, larger refreshes). |
| What counts | Time spent in the refresh operation itself: opening a new searchable segment from the indexing buffer. Aggregated across all indices unless scoped. |
| What does NOT count | Flush time (fsync of the translog to disk, tracked separately), merge time (background segment consolidation), and the indexing operation itself. Refresh, flush, and merge are three distinct lifecycle stages. |
| Time window | 1h (rolling, delta-based) |
| Alert trigger | > 1000ms, refreshes averaging over a second signal segments stacking up. |
| Roles | platform, sre, dba |
Calculation
The metric is a delta ratio over the one-hour window:> 1000ms alert is set where the delay starts to be felt in search freshness and where it reliably indicates the merge pipeline is falling behind.
Worked example
A platform team runs an Elasticsearch cluster that ingests a product catalogue feed plus a high-volume clickstream into time-based indices. On 03 Jun 26 the Avg Index Refresh Time card has drifted from a baseline of ~120ms to 1,340ms over the past hour and trips the sensitivity alert. PullingGET /_stats/refresh deltas for the busiest index:
| index | delta refresh count (1h) | delta refresh time (ms) | avg per refresh |
|---|---|---|---|
| clickstream-2026.06.03 | 3,600 | 4,824,000 | 1,340ms |
| products | 60 | 4,800 | 80ms |
products index is fine; the regression is entirely in clickstream-2026.06.03. The team checks segment counts with GET /_cat/segments/clickstream-2026.06.03?v and finds the shard holding 480 segments, far above the healthy double-digit range.
index.refresh_interval from 1s to 30s, cutting refresh frequency 30-fold and letting merges catch up. They also move the index’s data to gp3 volumes with higher provisioned IOPS so the merge pipeline is no longer I/O-starved. Within two hours the segment count falls to 60 and the average refresh time settles back to ~140ms.
- Refresh time is an indexing-health canary. It climbs before search latency does, because the segment proliferation that slows refreshes also eventually slows queries. Catching it here gives you a head start.
- The lever is usually
refresh_interval, scoped per index. Not every index needs one-second freshness. Analytical and log indices tolerate 30s happily; only user-facing search indices need sub-second refresh. Tune per index, not globally. - Disk I/O is the silent partner. Refresh and the merges behind it are I/O-bound. A refresh-time regression on slow disks is often really a disk problem wearing an indexing costume.
Sibling cards
| Card | Why pair it with Avg Index Refresh Time | What the combination tells you |
|---|---|---|
| Indexing Rate (docs/sec) | The ingestion volume driving segment creation. | A spike in indexing rate followed by rising refresh time is the classic “merges falling behind” pattern. |
| Bulk Rejections (24h) | The next failure stage if indexing back-pressure worsens. | Slow refreshes plus bulk rejections means the write pipeline is genuinely saturated. |
| Search Latency p95 (ms) | The downstream effect of segment proliferation. | Rising refresh time and rising p95 together confirm too many segments are hurting both write and read. |
| Replica Sync Lag | Replicas refresh too; lag and slow refresh share causes. | Slow refresh on replicas widens sync lag and delays consistency. |
| JVM Heap Used % | Merge pressure consumes heap; heap pressure slows merges. | High heap plus slow refresh is a self-reinforcing merge-pressure loop. |
| Storage Usage % | Many small segments also waste disk. | Slow refresh with climbing disk usage points at unmerged segment sprawl. |
| Elasticsearch Health Score | The composite that folds indexing health in. | A health dip with no search-side cause often traces back to refresh/merge strain. |
Reconciling against the source
Where to look in Elasticsearch itself:Why our number may legitimately differ from a manual reading:GET /_stats/refreshfor cluster-widerefresh.totalandrefresh.total_time_in_millis;GET /<index>/_stats/refreshto scope to one index. The card computes the same ratio over a delta.GET /_cat/segments/<index>?vshows the segment count per shard, the usual cause of a rising number.GET /_cat/shards/<index>?v&h=index,shard,prirep,segments.countgives a quick per-shard view.GET /<index>/_settings?filter_path=**.refresh_intervalconfirms the configured refresh interval, andGET /_nodes/stats/indices/mergesshows whether merges are keeping up.
| Reason | Direction | Why |
|---|---|---|
| Delta vs cumulative | Card higher during a regression | We use the hourly delta; dividing the raw lifetime counters gives a smoothed all-time average that hides recent slowdowns. |
| Scope | Either | The card aggregates across all indices by default; a single-index _stats call will differ if one index dominates. |
| Window boundary | Marginal | Your manual now and the card’s last poll bracket slightly different hours. |
| Counter reset on restart | Card dips | A node restart zeroes the cumulative counters; the next delta is computed from the restart, not from before it. |
| Managed service sampling | Either | Elastic Cloud and AWS-managed consoles may surface refresh metrics at their own cadence and granularity. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Indexing Rate (docs/sec) | Refresh time should rise with sustained high indexing. | Slow refresh while indexing is light points at disk I/O, not volume. |
| Search Latency p95 (ms) | Both rise together when segment count is the problem. | p95 rising while refresh is fine points at query complexity, not segments. |
Known limitations / FAQs
What is the difference between refresh, flush, and merge? Three distinct lifecycle stages. A refresh opens the in-memory indexing buffer as a new searchable Lucene segment (default every 1s); this is what makes new docs searchable. A flush fsyncs the translog to disk for durability and clears it. A merge is a background job that consolidates many small segments into fewer larger ones. This card measures only refresh time. Slow refreshes usually trace back to merges falling behind, but the counters are separate. My refresh time climbed but indexing volume did not change. Why? Look at disk I/O first. Refresh and the merges behind it are I/O-bound, so a degraded volume (noisy neighbour on shared storage, exhausted burst credits on gp2, a failing disk) slows refreshes even at constant load. CheckGET /_nodes/stats/fs and the host’s disk-utilisation metrics. A second possibility is a mapping change that added expensive fields (high-cardinality keywords, many sub-fields) which makes each segment more costly to build.
Can I just raise refresh_interval to fix this?
Often yes, and it is the most effective lever, but it is a trade-off, not a free win. A longer interval means fewer, larger refreshes (less merge pressure, lower refresh time) at the cost of search freshness: new documents take up to the interval to become searchable. Raise it for analytical and log indices that do not need sub-second freshness; keep it low for user-facing search indices. Set it per index, never blindly cluster-wide.
Does this card include the replicas?
By default the aggregate spans primaries and replicas, since replicas refresh independently to stay searchable. If a regression appears only on replica shards, suspect those nodes’ disks specifically. You can scope GET /<index>/_stats/refresh and inspect per-shard segment counts to isolate primary-vs-replica behaviour.
The number dropped to near zero suddenly. Is that good?
Check whether a node restarted. The underlying counters are cumulative since node start, so a restart resets them and the next hourly delta is computed from a near-empty base, which can read artificially low for the first hour. It can also mean indexing genuinely stopped (no new docs means few refreshes). Pair with Indexing Rate (docs/sec) to tell the two apart.
How does refresh time relate to search latency?
They share a root cause: segment proliferation. Too many segments make every refresh more expensive and also force searches to consult more segments per query, raising latency. So a rising refresh time is often an early warning that search latency will follow if merges do not catch up. If you see refresh time climbing, check Search Latency p95 (ms) and the segment count before it becomes a read-side problem too.
Is a high refresh time ever expected and acceptable?
During a large bulk reindex with refresh_interval set to -1 (refresh disabled), you may see a single very expensive refresh when it is re-enabled, because all the accumulated buffer flushes at once. That is intentional and a known reindex pattern. Outside such deliberate bulk loads, a sustained average over 1,000ms warrants investigation.