At a glance
This gauge reports what fraction of PostgreSQL’s page reads were served from its own shared-buffers cache rather than read from disk. It is the share ofblks_hitin the total ofblks_hit + blks_read. A high number means your working set fits in memory and queries rarely wait on storage; a number sliding below 95% on a production OLTP database is the classic sign thatshared_buffersis undersized for your working set, or that a query pattern has started scanning more data than the cache can hold. For a DBA this is one of the cleanest early-warning gauges you have: it bends before latency does, so it gives you a head start on a memory-pressure problem before users feel it.
| What it tracks | The ratio blks_hit / (blks_hit + blks_read), expressed as a percentage: the proportion of buffer reads satisfied from shared memory rather than the operating system / disk layer. |
| Data source | blks_hit and blks_read from pg_stat_database (per-database cumulative counters). The card reads the delta between polls so it reflects the recent window, not the all-time average since the stats were last reset. |
| Time window | RT/1h (a live reading plus a one-hour rolling view to smooth short bursts). |
| Alert trigger | <95%. On a production OLTP workload, a sustained hit rate under 95% is the threshold where disk reads start to materially affect latency. Analytics-heavy workloads can run lower by design; tune the threshold per instance. |
| Roles | dba, platform, sre |
Calculation
The hit rate is a straightforward ratio of two cumulative counters:blks_hit counts buffer reads that found the page already in PostgreSQL’s shared-buffers cache. blks_read counts reads that did not, and had to go to the storage layer. Crucially, a blks_read “miss” in PostgreSQL terms does not always mean a physical disk seek: the page may still be sitting in the operating system’s page cache, so PostgreSQL records a miss while the OS serves it from RAM. This is why a slightly-below-95% PostgreSQL hit rate on a Linux host can still feel fast, the OS is acting as a second cache tier. The gauge measures PostgreSQL’s own cache only.
Two computation details matter for reading the number correctly:
-
Delta, not lifetime. The raw counters in
pg_stat_databaseaccumulate from the last stats reset, so the lifetime ratio is dominated by history and barely moves. The card computes the delta between polls, so a fresh problem (a new full-table-scan query, a cold cache after restart) shows up promptly instead of being buried under months of good history. -
Per-database scope.
pg_stat_databasereports per database. The card surfaces the active database(s); a rarely-touched database with a poor ratio over a tiny sample is not the same signal as your busy OLTP database dipping.
pg_statio_user_tables (heap_blks_hit / heap_blks_read per table), which tells you which table is causing a global dip, the natural drill-down when this gauge moves.
Worked example
A platform team runs PostgreSQL 15 backing a product-catalogue API on a host with 32 GB RAM andshared_buffers set to 8 GB. The hit rate normally sits at 99.4%. Snapshot taken on 27 Apr 26 at 13:20 BST after a release that added a new search feature.
| Window | blks_hit | blks_read | Hit rate | State |
|---|---|---|---|---|
| 12:00 to 13:00 (pre-release) | 1.81B | 10.9M | 99.40% | healthy |
| 13:05 to 13:20 (post-release) | 0.42B | 31.5M | 93.0% | BREACH |
- Disk reads jumped sharply.
blks_readper minute roughly tripled whileblks_hitper minute fell. Something is now reading pages that are not in the cache, and reading a lot of them. - Latency is starting to follow, but has not spiked yet. Cross-reference Query Latency p95 (ms); it has crept from 22ms to 41ms. The hit-rate gauge bent first, which is exactly the head start this card is meant to give.
- The timing points at the release. The dip starts within minutes of the deploy that added search. The prime suspect is a new query doing a sequential scan over a large table that does not fit comfortably alongside the existing working set in 8 GB of shared buffers.
shared_buffers at a missing index hides the symptom and wastes RAM.
Three takeaways:
- Hit rate bends before latency breaks. That is the value of this gauge. By the time p95 latency is obviously bad, customers are already affected; a hit-rate dip gives you minutes-to-hours of warning to find the cause.
- A sudden dip is usually a query, not a capacity wall. When the rate falls right after a change, suspect a new full-table scan thrashing the cache, not a working set that quietly outgrew memory. The table-level drill-down (
pg_statio_user_tables) finds it fast. - 95% is an OLTP threshold, not a universal law. A reporting or analytics database that deliberately scans large cold datasets will run lower and that is fine. Set the threshold to your workload; do not chase 99% on a database whose job is to read data that will never fit in cache.
Sibling cards
| Card | Why pair it with Buffer Cache Hit Rate | What the combination tells you |
|---|---|---|
| Query Latency p95 (ms) | The downstream symptom of cache misses. | Hit rate dips first; if latency follows, the misses are reaching real disk. |
| Query Latency p99 (ms) | The tail most sensitive to disk reads. | A small hit-rate dip can blow out p99 long before p95 moves. |
| Slow-Query Rate % | Catches the query causing the thrash. | A new slow query plus a hit-rate dip equals a missing index thrashing the cache. |
| Memory Usage % | Distinguishes “cache too small” from “host out of RAM”. | Low hit rate with memory headroom equals raise shared_buffers; no headroom equals scale the host. |
| Top Tables by Dead Tuples | Bloat wastes cache on dead pages. | High bloat alongside a dip means the cache is holding dead weight. |
| Queries per Second (live) | Separates a load surge from a query regression. | Hit rate down with flat QPS equals a query change, not more traffic. |
| PostgreSQL Health Score | The composite that reflects cache efficiency. | A sustained dip nudges the composite before any alert-class card fires. |
Reconciling against the source
Where to look in PostgreSQL’s own tooling:Compute it directly:Why our number may legitimately differ from PostgreSQL’s own view:SELECT datname, blks_hit, blks_read, round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) AS hit_pct FROM pg_stat_database WHERE datname = current_database();. Note this is the lifetime ratio since the last reset, not the recent delta the card shows. Drill to the table level withpg_statio_user_tables(heap_blks_hit/heap_blks_read) and to indexes withpg_statio_user_indexesto find the source of a dip. Inspect a specific query withEXPLAIN (ANALYZE, BUFFERS) <query>;; theBuffers: shared hit=... read=...line is the per-query version of this metric. On a managed service, the provider exposes a related metric: RDS / Aurora surfaceBufferCacheHitRatio, Cloud SQL reports cache-hit ratios in its insights, and Azure exposes a similar buffer-cache metric. These align with the gauge but may be sampled and scoped differently.
| Reason | Direction | Why |
|---|---|---|
| Delta vs lifetime | Card more responsive | The card uses the between-poll delta; a hand-run pg_stat_database query shows the lifetime average since reset, which moves slowly and looks healthier during a fresh dip. |
| OS page cache | Card may look worse than felt latency | A blks_read miss may still be served from the Linux page cache in RAM; PostgreSQL counts it as a miss but the read is fast. The gauge measures PostgreSQL’s cache only. |
| Stats reset | Card resets too | pg_stat_reset() zeroes the counters; the lifetime ratio restarts, and the card’s delta is unaffected but the source query looks different. |
| Per-database scope | Possible mismatch | The card focuses on the active/busy database; a SELECT across all rows of pg_stat_database blends quiet databases in. |
| Managed-service definition | Slight difference | Provider cache-hit metrics may include OS-level or instance-level caching, so a provider number can read higher than PostgreSQL’s own blks_hit ratio. |
Known limitations / FAQs
My hit rate is 94% but queries feel fast. Is the alert a false alarm? Possibly, and the usual reason is the operating-system page cache. A PostgreSQLblks_read miss can still be served from the OS page cache in RAM, so the read is fast even though PostgreSQL counts it as a miss. The gauge measures PostgreSQL’s own shared-buffers cache, not the OS tier. If your latency cards are healthy and 94% is your normal baseline, lower the threshold for this instance in the Sensitivity tab rather than chasing it.
Should I just keep increasing shared_buffers until the rate hits 99%?
No. The common guidance is around 25% of host RAM for shared_buffers, because PostgreSQL relies on the OS page cache for the rest and oversizing shared buffers can hurt by double-caching and starving the OS. If the dip is caused by a single bad query thrashing the cache, more shared_buffers hides the problem; fix the query (usually a missing index) first. Only grow shared_buffers after confirming the genuine working set exceeds the current cache.
The rate dropped right after a deploy. Where do I start?
At the table level. Run the pg_statio_user_tables drill-down (ordered by heap_blks_read) to find which table is now generating the disk reads, then EXPLAIN (ANALYZE, BUFFERS) the new query against it. A sudden dip after a change is almost always a new sequential scan over a large table for want of an index, not a capacity problem.
Why does the card show a different number from my SELECT on pg_stat_database?
Because the card uses the recent delta between polls and the raw pg_stat_database query shows the lifetime average since the last stats reset. The lifetime ratio is dominated by history and moves very slowly, so during a fresh dip it can still read 99% while the card correctly shows the current 93%. The delta is what makes the card a useful early warning.
Does autovacuum or bloat affect the hit rate?
Indirectly, yes. A bloated table holds many dead tuples spread across more pages, so scans touch more pages to find live rows, which wastes cache on dead weight and lowers the effective hit rate. If a dip coincides with high bloat, check Top Tables by Dead Tuples and Oldest Autovacuum Age (hours); cleaning up the table can recover both space and cache efficiency.
Is 95% the right threshold for an analytics or reporting database?
Often not. A database whose job is to scan large, cold datasets that will never fit in memory will run a lower hit rate by design, and that is correct behaviour, not a fault. The 95% default targets OLTP workloads where the working set should be hot. For analytics instances, set a lower, workload-appropriate threshold so you are alerted on genuine regressions rather than on the nature of the workload.
After a restart the rate is terrible. Is something wrong?
No, that is a cold cache. On restart shared_buffers is empty, so early reads are nearly all misses until the working set warms back up. The rate climbs as hot pages reload. This is expected; the delta-based reading recovers within minutes on a busy instance. If you restart often and the cold-start dip is noisy, consider pg_prewarm to reload known-hot relations on startup.