What this audit checks
Authentication & access
- API key valid (auth on /v2/account) and not revoked
- Region host correct (US = api.opsgenie.com / EU = api.eu.opsgenie.com)
- Key has read scope on Alerts, Incidents, and Services
- Request-quota headroom > 15% (429 / Retry-After avoidance)
Alert coverage (the blind-spot test)
- Open alerts un-acknowledged > 30 min (no on-call pickup)
- Alerts with no responder / routing rule match (fires into the void)
- P1 / P2 alerts un-acknowledged at all (highest-severity coverage gap)
- Services with sustained alert volume but no declared incident (noise drowning signal)
- Alerts auto-closed without acknowledgement (silent dismissals)
Response speed & SLA health
- MTTA above 5 min sustained (acknowledgement lag = routing / coverage problem)
- MTTR above 60 min sustained (resolution lag = capacity problem)
- SLA compliance below 99.5% (reliability commitment slipping)
- Incidents open > 0 with no update in last 30 min (stalled response)
- Apdex below 0.85 / error rate > 2% / p95 > 1500ms on a tracked service
Alert economics & noise
- Top alerting service alert volume > 2σ vs its 30-day baseline (noise spike)
- Recurring error-type cluster trending up (fix-at-source candidate)
- Flapping alerts (open -> close -> open > 3 times in 24h on same alias)
- Throughput on a tracked service dropped > 30% WoW (capacity / outage signal)
Cross-channel: revenue-at-risk (the killer area)
- Open incident whose impactedServices intersect a commerce sibling’s checkout / payment service = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
- Alert storm (> 10 alerts/h) on a service that fronts checkout / payments / search during peak hours
- Alert spike on a commerce-critical service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- MTTR degradation on commerce-critical services correlated with a sibling commerce conversion / abandonment regression
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
alerts_unacknowledged_30min_count | 1 | 5 |
p1_p2_unacknowledged_count | 0 | 1 |
mtta_seconds | 300 | 600 |
mttr_seconds | 3600 | 7200 |
sla_compliance_pct | 99.9 | 99.5 |
incidents_open_count | 1 | 3 |
services_degraded_count | 1 | 2 |
services_down_count | 0 | 1 |
top_service_alert_volume_sigma | 2 | 3 |
throughput_change_pct_wow | -15 | -30 |
Data sources
GET https://api.{region}opsgenie.com/v2/account- Auth + key sanity + region checkGET https://api.{region}opsgenie.com/v2/alerts- Alert inventory + acknowledgement + routing coverage + MTTAGET https://api.{region}opsgenie.com/v2/alerts/count- Alert-volume counts for top-N + noise / baseline checksGET https://api.{region}opsgenie.com/v1/incidents- Incident inventory + impactedServices + MTTR (revenue-at-risk join)GET https://api.{region}opsgenie.com/v2/services- Service inventory + health state + open alert/incident counts