Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings A status page that lies to customers is worse than no status page. This audit answers four questions: (1) does the public status page match reality right now (components down with no published incident, incidents open on healthy components), (2) are the incidents we publish handled with discipline (acknowledged/resolved within target, not left stale), (3) are we keeping the uptime SLA the page advertises, and (4) when a component IS down, how much money is on fire per minute while a commerce sibling is live.

What this audit checks

Authentication & access

  • API key valid (auth on GET /v1/pages - returns the account’s pages)
  • page_id resolves to a page the key can read
  • Key has read scope on components, incidents, and metrics

Status-page truthfulness (the customer-trust test)

  • Components in partial_outage / major_outage with NO open incident published (page is lying)
  • Open incident referencing components that are all operational (stale incident, page over-reporting)
  • Components stuck in under_maintenance > 24h (forgotten maintenance window)
  • only_show_if_degraded components that are degraded but hidden from the public page
  • System metrics with no data in > 60 min (stale performance display)

Incident hygiene

  • Incidents in investigating state > 30 min without an identified/monitoring update
  • Mean time to acknowledge > 15 min (slow first update)
  • Mean time to resolve > 60 min (slow recovery)
  • Repeated incidents on the same component in 24h (noisy or unstable)
  • Top alerting components concentrating > 50% of incidents
  • Major/critical-impact incidents resolved without a postmortem

Reliability & SLA health

  • Component-group availability below SLA target (< 99.5% rolling 30D)
  • More than one component in major_outage concurrently (correlated outage)
  • Components in degraded_performance sustained > 15 min
  • Apdex below 0.85 / p95 above 1500ms on published system metrics
  • Error rate above 2% on published system metrics

Cross-channel: revenue-at-risk (the killer area)

  • Component in major_outage with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × outage_minutes × estimated_traffic_loss_pct)
  • Checkout / cart component degraded or down during peak commerce hours
  • Outage window overlapping a campaign push (sibling = google_ads / klaviyo) - paying for traffic that lands on a broken page
  • Conversion drop during published-incident windows (vs 90D baseline)

Severity thresholds

SignalWarnCritical
availability_pct99.999.5
incidents_open_count13
mtta_ms300000900000
mttr_ms18000003600000
components_major_outage_count11
components_degraded_count12
untruthful_components_count01
error_rate_pct12
p95_latency_ms8001500
apdex0.90.85
metric_staleness_min3060

Data sources

  • GET https://api.statuspage.io/v1/pages - Auth + page inventory
  • GET https://api.statuspage.io/v1/pages/{page_id}/components - Component status inventory (healthy/degraded/down truthfulness)
  • GET https://api.statuspage.io/v1/pages/{page_id}/component-groups - Group availability rollups for SLA
  • GET https://api.statuspage.io/v1/pages/{page_id}/incidents - Incident inventory + MTTA / MTTR + revenue-at-risk join
  • GET https://api.statuspage.io/v1/pages/{page_id}/incidents/unresolved - Currently-open incidents for truthfulness cross-check
  • GET https://api.statuspage.io/v1/pages/{page_id}/metrics - Published system metrics (apdex / latency / error-rate / throughput)
  • GET https://api.statuspage.io/v1/pages/{page_id}/metrics/{metric_id}/data - Metric data series for freshness + threshold checks