At a glance
The share of query executions that are “slow”, defined as statements whose mean execution time exceeds 100ms, expressed as a percentage of total calls. Where the percentile cards (p95, p99) tell you how slow the tail is, this card tells you how much of your workload is slow at all. A 2% slow-query rate on a million-call window means roughly 20,000 executions crossed the 100ms line; a 12% rate means more than one query in eight is dragging. It is the breadth metric: it catches the difference between “one report is slow” (narrow, low rate, possibly high p99) and “a large slice of normal traffic has degraded” (broad, high rate). For a DBA it is the early-warning gauge that an index regression, a stale plan, or a cache miss is spreading across the workload.
| Data source | pg_stat_statements. Vortex IQ samples the view each interval, computes the delta in calls and total_exec_time per normalised statement, and counts every statement whose windowed mean_exec_time exceeds 100ms as “slow”. The rate is slow-calls divided by total-calls over the window. |
| Metric basis | Call-weighted: a slow statement that ran 10,000 times contributes 10,000 slow calls, not one. The 100ms threshold is on the per-statement mean for the window, not on individual executions (the view does not store per-execution timing). |
| Aggregation window | 15m: the rate is computed over a rolling 15-minute window so a transient blip does not swing the gauge, while still surfacing a developing regression within minutes. |
| Unit | Percentage of total query calls (0 to 100). |
| What counts | Every statement captured by pg_stat_statements on the monitored database whose windowed mean exec time is above 100ms, weighted by its call count in the window. |
| What does NOT count | (1) Time waiting for a connection from the pool; (2) queries that errored before completing (those are on Query Error Rate %); (3) statements running below the pg_stat_statements.track granularity; (4) the 100ms threshold means genuinely fast statements never count even if they run billions of times. |
| Time window | 15m (rate computed over a rolling 15-minute window) |
| Alert trigger | >5%. Sustained slow-query rate above 5% means a meaningful and growing share of the workload has crossed the slow line, usually an index, plan, or cache problem rather than a single rogue query. |
| Roles | owner, engineering, operations |
Calculation
The card derives entirely from deltas onpg_stat_statements, so it reflects the live workload rather than lifetime cumulative history. Each interval the engine snapshots the view, and over the rolling 15-minute window it computes, per normalised statement:
- Threshold is on the windowed mean, not per execution. Because
pg_stat_statementsstores only a per-statement mean (not a histogram), a statement is classified slow-or-not as a whole for the window. A statement that averages 90ms is fully excluded even if some of its individual runs exceeded 100ms; a statement that averages 110ms contributes all its calls as slow. This is a deliberate, conservative approximation: it tends to slightly under-count slow individual executions and is stable, which is what you want in an alerting gauge. Where a provider exposes a true per-execution histogram (Aurora Performance Insights, Cloud SQL Query Insights), the card prefers that finer source. - Call-weighting makes the rate represent real traffic. A nightly maintenance statement that runs twice and takes 3 seconds each barely moves the rate, because it is two calls out of (say) two million. A web-tier lookup that regressed to 120ms and runs 400,000 times in the window dominates the rate. The gauge is therefore a faithful “what fraction of the work my database actually did was slow” measure, not a count of slow statement shapes.
Worked example
A logistics platform runs PostgreSQL 14 behind a tracking API. The baseline slow-query rate sits around 1.2%. At 08:05 on 17 Apr 26, shortly after a routine deploy, the Nerve Centre gauge climbs to 9.4% and the >5% alert fires. p99 latency is up too, but only modestly; the headline story is the breadth, not the depth. The DBA pulls the window detail:| Statement (normalised) | Windowed mean | Calls in window | Slow? | Share of slow calls |
|---|---|---|---|---|
SELECT * FROM shipments WHERE tracking_no = $1 | 118ms | 612,000 | yes | 71% |
SELECT ... FROM events WHERE shipment_id = $1 | 140ms | 188,000 | yes | 22% |
| everything else combined | < 100ms | 5.1M | no | 0% |
- Rate and percentile answer different questions. p99 told them “the tail is a bit worse”; the slow-query rate told them “a large, specific slice of normal traffic regressed”. A high rate concentrated in one statement is the classic signature of a single broken index or stale plan, which is faster to fix than a diffuse slowdown.
- Always decompose the rate by statement. A 9% rate spread evenly across hundreds of statements (cache cold after a restart, undersized instance) needs a capacity or configuration fix. A 9% rate concentrated in one statement needs a query or index fix. The gauge is the alarm; the per-statement breakdown is the diagnosis.
- The 100ms line is a convention, not physics. For this OLTP API, 100ms is a sensible “slow” cut-off. A reporting database where 100ms queries are normal would want the threshold higher, otherwise the gauge reads permanently red. Tune it to your workload in the Sensitivity tab.
Sibling cards to read alongside
| Card | Why pair it with Slow-Query Rate | What the combination tells you |
|---|---|---|
| Query Latency p99 (ms) | The depth of the tail vs the breadth of the slowness. | High rate plus high p99 equals a broad regression; low rate plus high p99 equals a narrow tail outlier. |
| Query Latency p95 (ms) | The “most users” tail. | If the slow-query rate is up and p95 moved, the slowness reaches a large share of traffic. |
| Query Latency p50 (ms) | The median. | Rate up but p50 flat means the slow statements are a distinct slice, not the whole workload. |
| Top 10 Slowest Queries | Names the statements behind the rate. | The first drill-down when the gauge spikes: which shapes are slow. |
| Buffer Cache Hit Rate % | Cache misses turn fast queries slow. | Rate up plus cache-hit down equals a cold cache or undersized shared_buffers, not a query bug. |
| Queries per Second (live) | The denominator context. | A spike in QPS that pushes the rate up may simply be a traffic surge overwhelming the instance. |
| Idle-in-Transaction Backends | Lock contention slows many statements at once. | Rate up plus idle-in-tx backends equals contention dragging unrelated queries. |
| PostgreSQL Health Score | The composite that weights latency health. | A sustained high slow-query rate pulls the composite down. |
Reconciling against the source
Where to look in PostgreSQL’s own tooling:On managed services:pg_stat_statementsis the source. To see which shapes are over the 100ms line right now:Note this orders by lifetime cumulative means, not the windowed delta the card uses, so the raw list will differ from the gauge’s window.log_min_duration_statementis the other native angle: set it (for example to 100ms) and PostgreSQL logs every statement that crosses the line to the server log, giving you per-execution truth thatpg_stat_statementscannot.auto_explainwithauto_explain.log_min_durationcaptures the plan of slow statements automatically, which is the fastest way to confirm a plan regression.
Amazon RDS / Aurora: Performance Insights surfaces top SQL by load and per-statement latency, and the slow-query log can be enabled via the parameter group. Aurora’s per-execution data is finer thanWhy our number may legitimately differ from a rawpg_stat_statements’ means. Google Cloud SQL: Query Insights ranks queries by latency and shows plan samples. Azure Database for PostgreSQL: Query Store plus thelog_min_duration_statementserver parameter.
pg_stat_statements read:
| Reason | Direction | Why |
|---|---|---|
| Windowed vs cumulative | Raw WHERE mean_exec_time > 100 differs both ways | The raw filter uses lifetime means; the card uses a rolling 15-minute delta, so a statement slow historically but fast now is excluded by the card and included by the raw query. |
| Mean-vs-execution threshold | Card may under-count slow runs | The 100ms test is on the windowed mean, not per execution; a statement averaging 95ms with some 200ms runs is not counted slow. A per-execution log (log_min_duration_statement) catches those. |
| Call-weighting | Neither is wrong | The card weights by calls; a simple count of slow statement shapes would look very different and is less operationally meaningful. |
| Reset timing | Transient | A pg_stat_statements_reset() mid-window is skipped by the card but zeroes the raw view. |
track = top | Card may miss nested time | With track = top, time in functions and triggers attributes to the top-level statement. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| Slow Queries During Checkout Window (5m) | A high overall slow-query rate during a checkout window should co-occur. | If the rate is high but checkout is unaffected, the slow statements are on a non-customer path. |
datadog.error-rate | A query that times out at the app tier may register slow here and as an error there. | App-tier timeouts cut the query off; PostgreSQL may still record the partial execution time. |
Known limitations / FAQs
Why is the 100ms threshold fixed? My reporting queries are always over 100ms. The 100ms cut-off is a sensible default for OLTP workloads but it is not fixed: adjust it per profile in the Sensitivity tab. On a reporting or analytics database where multi-hundred-millisecond queries are normal and expected, leaving the threshold at 100ms makes the gauge read permanently red and useless. Raise it to a number that represents “slow for this workload”, for example 1,000ms. The point of the gauge is deviation from your normal, not a universal speed limit. The slow-query rate is high but my p50 and p95 look fine. How can both be true? Easily. The slow-query rate is call-weighted across all statement shapes, while p50 and p95 describe the latency distribution of executions. A modest number of high-volume statements just over 100ms can lift the rate while the bulk of executions (which set p50 and p95) stay fast. This pattern usually means one or two specific statement shapes regressed; decompose the rate by statement to find them. Does this count queries that failed or timed out? No. Failed queries are tracked on Query Error Rate %. However, a query that the application cancelled on its own timeout may still have accumulated execution time inpg_stat_statements before the cancel, so a wave of app-tier timeouts can show up here as elevated slowness as well as on the error card. When both move together, suspect statements slow enough to breach the client timeout.
pg_stat_statements is not installed. Can I still get this card?
Not directly: the windowed rate depends on the extension. Install it (shared_preload_libraries plus CREATE EXTENSION pg_stat_statements), or on a managed service enable it via the parameter group. As a fallback, Vortex IQ can approximate the rate from log_min_duration_statement server-log volume where the provider exposes logs, but the native extension gives the cleanest reading.
A single statement is driving the whole rate. Should the gauge really go red for one query?
Yes, if that one statement carries a large share of your traffic. The rate is call-weighted precisely so that a high-volume statement that regressed dominates the gauge, because that is what actually degrades user experience. Conversely a rarely-run statement, however slow, barely moves the rate. The gauge measures impact on the workload, not the number of distinct slow shapes.
How is this different from the latency percentile cards?
The percentile cards measure how slow: p50 is the median experience, p99 is the worst realistic experience. This card measures how much: what fraction of all calls crossed the slow line. You can have a high p99 with a low rate (one narrow tail) or a high rate with a contained p99 (lots of queries just over 100ms but nothing catastrophic). Read them together: the percentiles give depth, the rate gives breadth.
The gauge swings up and down within a single shift. Is it broken?
Probably not. The 15-minute window smooths transients but a workload that genuinely oscillates (batch jobs starting and stopping, traffic bursts, cache warming after a restart) will move the rate legitimately. If the swings correlate with known events, that is the gauge working. If they are random and frequent, suspect an unstable plan that flips between a good and a bad execution path; auto_explain on the suspect statement will confirm.