At a glance
A single 0 to 100 composite that answers “is my PostgreSQL instance healthy right now?” without making a platform lead read five dashboards. It folds the five things that actually take a production database down: query latency, query errors, replication health, autovacuum currency, and connection-pool headroom. Each factor scores 0 to 1; the score is their product, scaled to 100. Because it is multiplicative, any single factor collapsing drags the whole score down hard, which is the point: one broken pillar (a stalled standby, a wraparound-risk autovacuum) is an emergency even if everything else is green. Read it as a traffic light: green is steady-state, amber means one pillar is wobbling, red means act now.
| Data source | Composite: p95 latency healthy x error-free x replication-healthy x autovacuum-current x pool-headroom. Each factor is computed from its own underlying source (pg_stat_statements for latency, pg_stat_database for errors, pg_stat_replication for replication, pg_stat_user_tables / pg_stat_all_tables for autovacuum age, and pg_stat_activity against max_connections for pool headroom). |
| Metric basis | A weighted multiplicative composite, NOT an average. The five factors are each normalised to a 0 to 1 health fraction, multiplied together, and scaled to 100. Multiplicative scoring means one factor at 0.4 caps the whole score near 40 even if the other four are perfect. |
| The five factors | (1) p95 latency healthy, scored against the p95 latency threshold; (2) error-free, from Query Error Rate %; (3) replication-healthy, from Replication Lag (seconds) and standby reachability; (4) autovacuum-current, from Oldest Autovacuum Age (hours); (5) pool-headroom, from Connection Pool Saturation %. |
| Aggregation window | RT/7D, a real-time current score plus a 7-day trend line so a slow decline (creeping bloat, a standby falling steadily behind) is visible before it becomes an outage. |
| What does NOT count | Disk usage and backup currency are tracked as their own hero cards and are not folded into this composite by default; a database can score 95 here while being one day from a full disk, so always read this alongside Database Disk Usage % and Last Successful Backup (hours ago). |
| Time window | RT/7D (real-time score with a 7-day trend) |
| Alert trigger | <70, a composite below 70 raises the sensitivity alert and signals that at least one pillar is degraded enough to need attention. |
| Roles | owner, engineering, operations |
Calculation
The score is the product of five independent health fractions, each in the range 0 to 1, scaled to 100:- p95 latency factor reads p95 query execution time from
pg_stat_statements. At or below the healthy threshold it scores 1.0; it decays towards 0 as latency climbs past the alert level (200ms by default). - error-free factor reads the rolling query error rate from
pg_stat_database(rollbacks and failed statements over total). Zero errors scores 1.0; it decays as the rate rises past the 1% alert level. - replication factor reads
pg_stat_replicationon the primary. A reachable standby with sub-second lag scores 1.0; lag growth decays it, and an unreachable / BROKEN standby collapses it towards 0. - autovacuum factor reads the maximum time since last (auto)vacuum across user tables. Recently vacuumed scores 1.0; it decays as the oldest table ages past 24 hours, reflecting accumulating bloat and transaction-ID wraparound risk.
- pool-headroom factor reads active plus idle-in-transaction backends against
max_connections. Comfortable headroom scores 1.0; it decays towards 0 as saturation approaches 100%.
Worked example
A platform team runs a primary PostgreSQL 16 instance with one streaming standby behind PgBouncer, serving an internal analytics and orders API. Two snapshots, taken on 22 Apr 26. Snapshot A, 09:00 BST, steady state:| Factor | Underlying reading | Factor value |
|---|---|---|
| p95 latency | 88ms (threshold 200ms) | 1.00 |
| error-free | 0.04% error rate | 0.99 |
| replication | standby reachable, lag 0.3s | 1.00 |
| autovacuum | oldest table vacuumed 2h ago | 1.00 |
| pool headroom | saturation 41% | 1.00 |
| Factor | Underlying reading | Factor value |
|---|---|---|
| p95 latency | 240ms (over the 200ms threshold) | 0.82 |
| error-free | 0.05% error rate | 0.99 |
| replication | standby reachable, lag 14s (over 10s alert) | 0.55 |
| autovacuum | oldest table vacuumed 27h ago | 0.78 |
| pool headroom | saturation 72% | 0.92 |
<70 alert has fired.
The story the score tells: a large batch import kicked off at 13:30. It generated heavy write WAL, which the standby is struggling to apply (lag jumped to 14s, the single biggest drag on the score at 0.55). The same write load is generating dead tuples faster than autovacuum is keeping up (oldest table now 27h, factor 0.78), and the extra query volume has pushed p95 latency just over threshold (0.82). No single factor is catastrophic on its own, but the multiplicative composite correctly compounds three simultaneous wobbles into a red 32.
The platform team’s read, in order of leverage:
- Replication is the dominant drag (0.55). A 14-second standby lag means failover would lose up to 14 seconds of writes and that read replicas are serving stale data. Open Replication Lag (seconds) and confirm whether the standby is apply-bound (catching up slowly) or network-bound.
- Autovacuum is falling behind (0.78). The batch created bloat faster than autovacuum reclaimed it. Check Top Tables by Dead Tuples and consider a manual
VACUUMon the hot table or tuningautovacuum_vacuum_cost_limit. - Latency over threshold is a symptom, not a cause (0.82). It will self-correct once the batch finishes and the standby catches up. Do not chase it independently.
- A low score names the broken pillar; the drill-down tells you which one. Never act on the composite alone. Open the factor breakdown, find the factor nearest 0, and start there. The lowest factor is your highest-leverage fix.
- The score is multiplicative on purpose. One collapsed pillar should cap the whole score, because in production one collapsed pillar (a dead standby, a wraparound-risk table) is an emergency regardless of how healthy everything else is.
- Two cards live outside the composite. Disk usage and backup currency are deliberately excluded so they cannot be masked. A 99 health score with a disk at 94% full is still an imminent outage; always read this hero card next to Database Disk Usage % and the backup card.
Sibling cards to reference together
| Card | Why pair it with the Health Score | What the combination tells you |
|---|---|---|
| Query Latency p95 (ms) | The latency factor in the composite. | When the score drops and this is the lowest factor, the database is slow but functioning. |
| Query Error Rate % | The error-free factor. | A score drop driven by this factor means statements are failing, not just slowing. |
| Replication Lag (seconds) | The replication factor. | The most common dominant drag; a stalled standby collapses this factor and the score. |
| Oldest Autovacuum Age (hours) | The autovacuum factor. | A slow score decline over the 7-day trend usually means creeping autovacuum starvation. |
| Connection Pool Saturation % | The pool-headroom factor. | A sharp score drop at peak traffic often traces to saturation eroding this factor. |
| Database Disk Usage % | Deliberately excluded from the composite. | A high score with a near-full disk is still an emergency; always read together. |
| Last Successful Backup (hours ago) | Also excluded from the composite. | Health and recoverability are separate concerns; a healthy unrecoverable database is a risk. |
| Deadlocks (last 5m) | An acute event that shows up in the error-free factor. | A deadlock burst dents the error-free factor and nudges the score down. |
Reconciling against the source
Where to look in PostgreSQL’s own tooling:There is no single native command that produces this score, it is a Vortex IQ composite. To reconcile it, verify each factor against its own source: p95 latency:Why our number may legitimately differ from a manual factor-by-factor reconstruction:SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC;(requires thepg_stat_statementsextension). error rate:SELECT datname, xact_commit, xact_rollback FROM pg_stat_database;for the rollback ratio. replication:SELECT client_addr, state, replay_lag FROM pg_stat_replication;on the primary. autovacuum age:SELECT relname, last_autovacuum, n_dead_tup FROM pg_stat_user_tables ORDER BY last_autovacuum NULLS FIRST;pool headroom:SELECT count(*) FROM pg_stat_activity;againstSHOW max_connections;Managed-service console: on Amazon RDS / Aurora, Performance Insights plus the CloudWatch metrics forReadLatency,ReplicaLag,DatabaseConnectionsandMaximumUsedTransactionIDs. On Cloud SQL, the system insights and thereplication/replica_lag,database/postgresql/num_backends, and transaction-ID metrics. Each maps to one factor.
| Reason | Direction | Why |
|---|---|---|
| Multiplicative vs intuitive | Score lower than expected | Hand-reconciling, people instinctively average the factors; the composite multiplies, so the score is always at or below the mean. This is by design. |
| Snapshot timing | Variable | The real-time score samples all five factors at the same instant; running the five queries by hand minutes apart can show a different blend. |
| Threshold tuning | Variable | Each factor’s decay curve uses the profile’s configured alert thresholds. If you tuned the p95 or saturation thresholds, the factor values shift accordingly. |
| Extension availability | Latency factor may default | If pg_stat_statements is not installed, the latency factor falls back to a coarser signal; install the extension for the precise factor. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
pg_replication_lag_seconds | A lag spike should pull the replication factor and the composite down together. | If the score dropped but lag is fine, a different factor is the cause; check the breakdown. |
pg_pool_saturation | Saturation peaks at traffic peaks should dent pool-headroom and the score. | Score steady through a saturation peak = the peak stayed below the decay knee. |
| Application error rate (ecom / app connector) | A sustained low health score usually corresponds to elevated app-tier errors. | App errors with a green score = the fault is above the database layer. |
Known limitations / FAQs
Why is my score 32 when four of the five factors are perfectly healthy? Because the score multiplies the factors rather than averaging them. One factor at 0.55 (a 14-second standby lag, say) caps the whole composite near that level no matter how green the other four are: 1.0 x 1.0 x 1.0 x 1.0 x 0.55 is still only 0.55. This is intentional. In production, one broken pillar is an emergency, and an averaging score would let four healthy pillars hide it. Open the factor breakdown, find the factor nearest 0, and fix that. Does the score include disk usage and backups? No, deliberately. Disk usage and backup currency are tracked as their own hero cards and kept out of the composite so they can never be masked by a healthy score. A database can read 98 here while sitting at 95% disk or with a 4-day-old backup. Always read this card alongside Database Disk Usage % and Last Successful Backup (hours ago). My score is amber (around 75) but everything feels fine. Should I worry? Amber usually means one factor is sitting in its decay zone without having fully collapsed: a standby a few seconds behind, autovacuum a little late, latency just over threshold. It is not an emergency, but it is the moment to look, because the multiplicative form means a second wobble while you are already amber will push you red fast. Treat amber as “investigate before the next traffic peak”, not “ignore”. The score has been slowly declining over the 7-day trend with no obvious incident. What causes that? A slow decline almost always means creeping autovacuum starvation or a standby gradually falling behind under steady write growth. These are exactly the slow-bleed problems the 7-day window is there to catch. Open Oldest Autovacuum Age (hours) and Replication Lag (seconds) and look for a matching upward creep. I do not run a standby. Does the replication factor break the score? No. On a single-instance deployment with no configured standby, the replication factor is set to neutral (1.0) so it neither helps nor harms the composite, since there is nothing to be unhealthy. The factor only becomes active oncepg_stat_replication reports a configured standby.
Does pg_stat_statements need to be installed for the score to work?
The latency factor is most accurate with pg_stat_statements enabled, since that is where per-statement execution times live. Without it, the factor falls back to a coarser latency signal and the score is slightly less precise. The other four factors do not depend on it. Installing the extension (it ships with PostgreSQL and is enabled via shared_preload_libraries) is strongly recommended for an accurate score.
Can I change the weighting or the alert threshold?
The <70 alert threshold is configurable per profile in the Sensitivity tab. The per-factor decay curves follow the same thresholds you set on the underlying cards (p95, saturation, lag, error rate, autovacuum age), so tuning those cards retunes the composite consistently. There is no separate per-factor weight to set: the factors are equal multiplicands by design, which is what makes a single collapse honest.