What this audit checks
Authentication & access
- Connection string resolves and the monitoring role can read pg_catalog (USAGE granted, pg_read_all_stats present)
- pg_stat_statements extension is installed and present in shared_preload_libraries (slow-query KPIs depend on it)
- sslmode=require is set on managed endpoints (RDS / Aurora / Crunchy / Heroku); plaintext flagged
- Monitoring role is read-only - no write or superuser privileges beyond stats access
Connection & availability
- Instance is reachable and pg_postmaster_start_time() returns a recent, stable uptime (no unplanned restart)
- Connection error count over 24h below the alert band (transient auth / network failures bounded)
- PostgreSQL major version is supported and not past end-of-life from pg_settings server_version
- max_connections from pg_settings sized with headroom against observed peak backend count
Query performance
- Query latency p95 within band - mean_exec_time aggregate from pg_stat_statements under threshold
- Query latency p99 tail not diverging from p95 (no long-tail lock contention)
- Slow-query rate (mean_exec_time > 100ms as share of total calls) under threshold
- Buffer cache hit rate (blks_hit / blks_hit + blks_read from pg_stat_database) above 95% on OLTP
Replication & lag
- Streaming replication lag in seconds (replay_lag from pg_stat_replication) under threshold
- WAL bytes pending (pg_wal_lsn_diff primary vs replay_lsn) not growing sustained - standby keeping up
- Every configured standby has state=streaming - none in catchup, broken, or stopped
- At least one healthy standby with sub-second lag exists for a safe promotion path
Storage & capacity
- Database disk usage percent under threshold with growth-rate runway before exhaustion
- Connection-pool saturation (active + idle-in-tx / max_connections) under threshold
- Idle-in-transaction backend count bounded - stuck transactions hold locks and bloat the pool
- Autovacuum current on hot tables - oldest n_dead_tup table not unvacuumed past threshold (wraparound risk)
Backups & durability
- Last successful base backup / WAL archive age within the recovery-point objective
- WAL archiving is active and not failing (archive_command succeeding, no pile-up)
- Point-in-time recovery window covers the required retention horizon
Cross-channel: revenue protection
- QPS spike with no concurrent ecom order spike (sibling = shopify / bigcommerce / adobe order rate flat) flagged as bot / scraper load
- Pool saturation > 90% during a commerce-sibling traffic burst (pool exhaustion = lost checkouts)
- Slow-query spike co-occurring with a sibling checkout-completion drop in the same 5-min window
- Inventory-table row drift vs sibling product_inventory count (oversell risk when DB is source of truth)
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
connection_error_rate | 1 | 5 |
query_p95_ms | 200 | 500 |
replication_lag_sec | 10 | 30 |
disk_usage_pct | 80 | 90 |
slow_query_count | 5 | 15 |
Data sources
GET postgresql://{host}:{port}/{database}#pg_settings- Instance config - version, max_connections, shared_buffers, server_versionGET postgresql://{host}:{port}/{database}#pg_stat_activity- Live backends, states, idle-in-transaction, application_name for pool saturationGET postgresql://{host}:{port}/{database}#pg_stat_statements- Normalised query stats - mean_exec_time, calls, total_exec_time for slow-query + p95/p99GET postgresql://{host}:{port}/{database}#pg_stat_replication- Standby state, replay_lag, WAL pending bytes for replication healthGET postgresql://{host}:{port}/{database}#pg_stat_database- Cache hit rate, deadlocks, connection errors per databaseGET postgresql://{host}:{port}/{database}#pg_stat_user_tables- Dead tuples + last autovacuum per table for bloat / starvation checks