Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings Opsgenie alert and incident state means little to a merchant unless it’s joined to the revenue those services protect. This audit answers four questions: (1) is the API key still good and the alerts / incidents / services readable, (2) is the on-call process actually covering the alerts that fire (un-acknowledged alerts, no-routing gaps, noisy services), (3) are we acknowledging and resolving fast enough to hold SLA (MTTA / MTTR / SLA compliance), and (4) when a service IS on fire, how much money is on fire per minute when it fronts a commerce-critical path?

What this audit checks

Authentication & access

  • API key valid (auth on /v2/account) and not revoked
  • Region host correct (US = api.opsgenie.com / EU = api.eu.opsgenie.com)
  • Key has read scope on Alerts, Incidents, and Services
  • Request-quota headroom > 15% (429 / Retry-After avoidance)

Alert coverage (the blind-spot test)

  • Open alerts un-acknowledged > 30 min (no on-call pickup)
  • Alerts with no responder / routing rule match (fires into the void)
  • P1 / P2 alerts un-acknowledged at all (highest-severity coverage gap)
  • Services with sustained alert volume but no declared incident (noise drowning signal)
  • Alerts auto-closed without acknowledgement (silent dismissals)

Response speed & SLA health

  • MTTA above 5 min sustained (acknowledgement lag = routing / coverage problem)
  • MTTR above 60 min sustained (resolution lag = capacity problem)
  • SLA compliance below 99.5% (reliability commitment slipping)
  • Incidents open > 0 with no update in last 30 min (stalled response)
  • Apdex below 0.85 / error rate > 2% / p95 > 1500ms on a tracked service

Alert economics & noise

  • Top alerting service alert volume > 2σ vs its 30-day baseline (noise spike)
  • Recurring error-type cluster trending up (fix-at-source candidate)
  • Flapping alerts (open -> close -> open > 3 times in 24h on same alias)
  • Throughput on a tracked service dropped > 30% WoW (capacity / outage signal)

Cross-channel: revenue-at-risk (the killer area)

  • Open incident whose impactedServices intersect a commerce sibling’s checkout / payment service = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
  • Alert storm (> 10 alerts/h) on a service that fronts checkout / payments / search during peak hours
  • Alert spike on a commerce-critical service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
  • MTTR degradation on commerce-critical services correlated with a sibling commerce conversion / abandonment regression

Severity thresholds

SignalWarnCritical
alerts_unacknowledged_30min_count15
p1_p2_unacknowledged_count01
mtta_seconds300600
mttr_seconds36007200
sla_compliance_pct99.999.5
incidents_open_count13
services_degraded_count12
services_down_count01
top_service_alert_volume_sigma23
throughput_change_pct_wow-15-30

Data sources

  • GET https://api.{region}opsgenie.com/v2/account - Auth + key sanity + region check
  • GET https://api.{region}opsgenie.com/v2/alerts - Alert inventory + acknowledgement + routing coverage + MTTA
  • GET https://api.{region}opsgenie.com/v2/alerts/count - Alert-volume counts for top-N + noise / baseline checks
  • GET https://api.{region}opsgenie.com/v1/incidents - Incident inventory + impactedServices + MTTR (revenue-at-risk join)
  • GET https://api.{region}opsgenie.com/v2/services - Service inventory + health state + open alert/incident counts