At a glance
Query Latency p99 (ms) is the 99th-percentile read latency for the MongoDB deployment over a rolling 5-minute window: 99% of reads finish faster than this number, and the slowest 1% finish slower. This is the worst-case line that platform teams care about for SLOs. The slowest 1% of reads are where timeouts, retries, and abandoned requests live, and a single user can hit several of them in one session. The card alerts at 500ms, because a p99 above half a second means real requests are stalling long enough to trip client timeouts and cascade retries upstream.
| What it tracks | The 99th-percentile read operation latency for the deployment, the far tail of the read-latency distribution. |
| Data source | The opLatencies.reads operation-latency histogram from serverStatus (same source family as the p50 and p95 cards, read at the 99th-percentile bucket). |
| Time window | RT/5m (real-time, rolling 5-minute window). |
| Alert trigger | > 500ms sustained. A p99 above 500ms means the slowest 1% of reads are stalling long enough to risk client timeouts and retry storms. |
| Calculation basis | Read-side latency only; the extreme tail, not the typical case. |
| Units | Milliseconds (ms). |
| Roles | platform, engineering, sre |
Calculation
The p99 is sourced from the sameserverStatus operation-latency surface as its p50 and p95 siblings, but read at the far end of the distribution. On deployments that expose the latency histogram (opLatencies with histogram buckets, surfaced natively on Atlas), the engine reads the 99th-percentile bucket directly for the 5-minute window. The card samples the cumulative latencies.reads counters at the window boundaries and works from the delta so the value reflects current behaviour, not the average since process start.
Where the histogram is not available, MongoDB exposes only scalar latency and ops totals, from which a true 99th percentile cannot be recovered. In that mode the card reports the windowed mean read latency as a conservative floor and labels the reading as an average rather than a percentile, so you are not misled into thinking a 1%-tail figure is available when it is not. The value is converted from microseconds to milliseconds for display, and the 500ms alert is evaluated on the windowed reading so a lone slow query does not trip it.
The p99 is, by construction, noisier than the p95: it summarises far fewer operations, so a small number of pathological reads can move it sharply. That sensitivity is the point. p99 is where you see the rare-but-real stalls that an aggregate average would bury.
Worked example
A platform team runs a sharded cluster backing the catalogue and cart services for a flash-sale event. Snapshot taken on 22 May 26 at 12:00 BST, ten minutes into a promotion that drove a sudden traffic spike.| Window (5m) | Reads in window | p50 (ms) | p95 (ms) | p99 (ms) | Note |
|---|---|---|---|---|---|
| 11:45 | 380,000 | 4 | 52 | 118 | Pre-sale baseline |
| 11:50 | 620,000 | 5 | 71 | 190 | Sale opens, traffic spikes |
| 11:55 | 910,000 | 6 | 144 | 470 | Tail stretching |
| 12:00 | 940,000 | 8 | 211 | 612 | Alert fires (p99 > 500ms) |
- p99 is an SLO line, not a vanity metric. If your service promises “99% of requests under X ms”, this card is the database-layer input to that promise. When it breaches, your SLO is at risk before any average shows it.
- The p99-to-p95 gap is diagnostic. A wide gap means a few very slow operations (chase those specific queries); a narrow gap that moves with p95 means the whole tail is stretching (chase capacity).
- p99 stalls cause retry storms. Reads slow enough to hit client timeouts get retried, which adds load, which slows more reads. A breaching p99 can be self-amplifying, so act on it before it compounds.
Sibling cards
| Card | Why pair it with Query Latency p99 | What the combination tells you |
|---|---|---|
| Query Latency p95 (ms) | The next tail line down. | A wide p99-to-p95 gap means a few very slow ops; a narrow gap means the whole tail is stretching. |
| Query Latency p50 (ms) | The median baseline. | p99 high with p50 flat confirms the problem is the far tail, not the typical query. |
| Top 10 Slow Operations | The named list of the worst offenders. | Identifies the exact operations sitting in the 1% tail. |
| Slow Ops (15m, >100ms) | The profiler count behind the tail. | A rising slow-op count timed to the p99 breach confirms which window to investigate. |
| Shard Balance Skew % | The hot-shard explanation on sharded clusters. | A skew spike timed to the p99 rise points at one overloaded shard. |
| WiredTiger Cache Hit Rate % | The cache-pressure explanation. | A falling hit rate that tracks the p99 rise confirms disk fetches are stretching the tail. |
| Query Error Rate % | The reliability peer. | p99 breach plus rising errors means the slow tail is now timing out and failing. |
| MongoDB Health Score | The composite roll-up. | Confirms whether the far-tail latency is pulling overall health below its threshold. |
Reconciling against the source
Where to look in MongoDB’s own tooling:RunWhy our number may legitimately differ from MongoDB’s native view:db.serverStatus().opLatenciesinmongosh; on builds with histograms the document includes the distribution buckets used to derive percentiles, while older builds expose onlylatencyandopsscalars. Use the database profiler (db.setProfilingLevel(1, { slowms: 100 })) and querysystem.profilesorted bymillisdescending to read the actual slowest operations that occupy the 1% tail. On MongoDB Atlas, the Metrics tab plots “Read Latency” with native p99 lines per node; select the node serving the workload and align to the 5-minute window.mongostatandmongotopgive a live, low-overhead view of where read time is being spent by collection.
| Reason | Direction | Why |
|---|---|---|
| Tail sampling noise | Variable | p99 summarises far fewer ops than p95, so two tools sampling at slightly different instants can report meaningfully different far-tail values. |
| Cumulative vs windowed | Native scalar often lower | opLatencies is cumulative since process start; the card windows it to 5 minutes, sharpening recent spikes. |
| Percentile vs mean fallback | Native (Atlas) higher | Without a histogram the card reports a windowed mean as a floor; Atlas reports a true 99th percentile, which is always at or above the mean. |
| Per-node scope | Variable | We read the targeted node; on a sharded or secondary-reading workload the cluster-wide tail can be worse than any single node shows. |
Known limitations / FAQs
Why is the alert at 500ms here but 200ms on the p95 card? The two cards measure different slices of the distribution. p95 at 200ms says a noticeable fraction of reads is slow; p99 at 500ms says the worst 1% has reached the point of tripping client timeouts and retries. The higher absolute threshold reflects that the far tail is naturally slower; a 200ms p99 would be excellent, so alerting there would be noise. My p99 is very spiky compared to p95. Is the card broken? No, that volatility is inherent to the 99th percentile. It is computed from far fewer operations than p95, so a handful of pathological reads can move it sharply window to window. Treat sustained breaches, not single-window spikes, as actionable; the alert evaluates the windowed value to filter out lone slow queries. p99 alerted but p50 and p95 look fine. Where do I even start? A clean p50 and p95 with a breaching p99 is the classic “a few very slow operations” pattern. Go straight to Top 10 Slow Operations and the profiler (system.profile) sorted by millis. The cause is usually one query: a missing index, a COLLSCAN on a large collection, a hot document under contention, or a lookup against a hot shard.
The card reports an average, not a true p99. Why?
Deployments that do not expose the operation-latency histogram cannot produce a real 99th percentile from serverStatus alone, so the card falls back to a windowed mean as a conservative floor and labels it as such. To get a genuine p99, run a build with latency histograms enabled or read the percentile directly from the Atlas Metrics view.
Should I tune the 500ms threshold for my workload?
Yes, if your SLO demands it. A latency-critical service may want to alert at 300ms; a batch or analytics workload may tolerate seconds. The sensitivity threshold is configurable per profile in the Sensitivity tab. Set it to the latency budget your application actually promises rather than the generic default.
Does a p99 breach mean queries are failing?
Not by itself; it means they are slow. But slow reads that exceed the client’s timeout will be cancelled and retried, which can turn a latency problem into an error problem. Watch Query Error Rate % alongside this card: if errors rise in step with the p99, your timeouts are firing and a retry storm may be building.