Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings PostgreSQL-specific health audit for production database instances. Answers six questions: (1) is stats access correctly provisioned (read-only role, pg_stat_statements installed, SSL on managed endpoints); (2) is the instance up and reachable with connection errors bounded; (3) is query latency within the p95 / p99 bands and is the slow-query population trending up; (4) are streaming standbys keeping up and is WAL shipping faster than it accumulates; (5) is the instance ahead on disk, memory, and connection-pool capacity; (6) are backups current and durable enough to meet the recovery objective. Cross-channel area joins query and pool signals to commerce-sibling checkout funnel to size live revenue at risk.

What this audit checks

Authentication & access

  • Connection string resolves and the monitoring role can read pg_catalog (USAGE granted, pg_read_all_stats present)
  • pg_stat_statements extension is installed and present in shared_preload_libraries (slow-query KPIs depend on it)
  • sslmode=require is set on managed endpoints (RDS / Aurora / Crunchy / Heroku); plaintext flagged
  • Monitoring role is read-only - no write or superuser privileges beyond stats access

Connection & availability

  • Instance is reachable and pg_postmaster_start_time() returns a recent, stable uptime (no unplanned restart)
  • Connection error count over 24h below the alert band (transient auth / network failures bounded)
  • PostgreSQL major version is supported and not past end-of-life from pg_settings server_version
  • max_connections from pg_settings sized with headroom against observed peak backend count

Query performance

  • Query latency p95 within band - mean_exec_time aggregate from pg_stat_statements under threshold
  • Query latency p99 tail not diverging from p95 (no long-tail lock contention)
  • Slow-query rate (mean_exec_time > 100ms as share of total calls) under threshold
  • Buffer cache hit rate (blks_hit / blks_hit + blks_read from pg_stat_database) above 95% on OLTP

Replication & lag

  • Streaming replication lag in seconds (replay_lag from pg_stat_replication) under threshold
  • WAL bytes pending (pg_wal_lsn_diff primary vs replay_lsn) not growing sustained - standby keeping up
  • Every configured standby has state=streaming - none in catchup, broken, or stopped
  • At least one healthy standby with sub-second lag exists for a safe promotion path

Storage & capacity

  • Database disk usage percent under threshold with growth-rate runway before exhaustion
  • Connection-pool saturation (active + idle-in-tx / max_connections) under threshold
  • Idle-in-transaction backend count bounded - stuck transactions hold locks and bloat the pool
  • Autovacuum current on hot tables - oldest n_dead_tup table not unvacuumed past threshold (wraparound risk)

Backups & durability

  • Last successful base backup / WAL archive age within the recovery-point objective
  • WAL archiving is active and not failing (archive_command succeeding, no pile-up)
  • Point-in-time recovery window covers the required retention horizon

Cross-channel: revenue protection

  • QPS spike with no concurrent ecom order spike (sibling = shopify / bigcommerce / adobe order rate flat) flagged as bot / scraper load
  • Pool saturation > 90% during a commerce-sibling traffic burst (pool exhaustion = lost checkouts)
  • Slow-query spike co-occurring with a sibling checkout-completion drop in the same 5-min window
  • Inventory-table row drift vs sibling product_inventory count (oversell risk when DB is source of truth)

Severity thresholds

SignalWarnCritical
connection_error_rate15
query_p95_ms200500
replication_lag_sec1030
disk_usage_pct8090
slow_query_count515

Data sources

  • GET postgresql://{host}:{port}/{database}#pg_settings - Instance config - version, max_connections, shared_buffers, server_version
  • GET postgresql://{host}:{port}/{database}#pg_stat_activity - Live backends, states, idle-in-transaction, application_name for pool saturation
  • GET postgresql://{host}:{port}/{database}#pg_stat_statements - Normalised query stats - mean_exec_time, calls, total_exec_time for slow-query + p95/p99
  • GET postgresql://{host}:{port}/{database}#pg_stat_replication - Standby state, replay_lag, WAL pending bytes for replication health
  • GET postgresql://{host}:{port}/{database}#pg_stat_database - Cache hit rate, deadlocks, connection errors per database
  • GET postgresql://{host}:{port}/{database}#pg_stat_user_tables - Dead tuples + last autovacuum per table for bloat / starvation checks