What this audit checks
Authentication & access
- REST API token valid + has read scope (probe on /abilities)
- Region host correct (us → api.pagerduty.com, eu → api.eu.pagerduty.com)
- Every configured routing key resolves to an existing, active service
- Backup channel (Slack / Teams) configured so failures never silently un-page
Routing & escalation coverage (the blind-spot test)
- Services with a revoked or rotated Events API v2 routing key (pages go nowhere)
- Vortex IQ severity tier (sev1 / sev2 / sev3) with no PagerDuty service mapped
- Sev1-mapped escalation policy without a 24/7 always-on rota
- Escalation policy with zero escalation steps (no fallback responder)
- On-call schedule with an uncovered gap in the next 24h
Submission pipeline reliability
- Event submission success rate below 99% (events being dropped)
- Median submission latency above 5000ms (page delayed)
- Retried submissions (429 / 5xx) above 2σ vs 30D baseline
- Fail-open audit-logged events > 0 in last 24h (events the API could not accept)
Response performance
- MTTA above 15 min (sev1 above 5 min)
- MTTR above 4h
- Escalation rate above 30% (first-line overloaded / under-staffed)
- Incident volume on any service above 2× the service average (noisy surface)
Back-sync integration health
- Webhook back-sync lag above 300s (stale Vortex IQ incident timeline)
- Webhook subscription with a failed last delivery
- Webhook subscription toggled inactive (state changes stop flowing back silently)
Cross-channel: revenue-at-risk paging (the killer area)
- Open sev1 incident with a sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
- Sev1 page un-acknowledged > 5 min during the peak trading window (worst-case missed page)
- Incident on a service whose name matches the commerce checkout / payment path (revenue-critical)
- Routing key revoked on the sev1 commerce-paging service (a silent un-page on the highest-value path)
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
event_success_rate_pct | 99.5 | 99 |
submission_latency_ms | 2000 | 5000 |
mtta_sec | 600 | 900 |
mttr_sec | 7200 | 14400 |
escalation_rate_pct | 20 | 30 |
routing_key_health_pct | 99 | 95 |
revoked_routing_key_count | 1 | 1 |
unmapped_service_count | 1 | 1 |
schedule_gap_count | 1 | 1 |
webhook_backsync_lag_sec | 120 | 300 |
webhook_failure_count | 1 | 1 |
fail_open_event_count | 1 | 1 |
Data sources
GET https://api.{region_host}/abilities- Auth + scope sanityGET https://api.{region_host}/incidents- Incident inventory + ack/resolve timing + escalation countGET https://api.{region_host}/services- Service inventory + routing-key status + severity-tier mappingGET https://api.{region_host}/escalation_policies- Escalation policy depth + always-on rota presenceGET https://api.{region_host}/schedules- On-call schedule coverage gapsGET https://api.{region_host}/webhook_subscriptions- Webhook delivery status + back-sync lagPOST https://api.{region_host}/analytics/metrics/incidents/all- Aggregated MTTA / MTTR / escalation-rate