Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings A paging surface is only worth anything if the page actually lands and someone moves. This audit answers four questions: (1) is the REST token + every per-service routing key valid and mapped, (2) does every Vortex IQ severity tier route to a service with a real escalation policy and an always-on rota, (3) are we acknowledging and resolving fast enough (MTTA / MTTR / escalation rate, submission success, back-sync lag), and (4) when a sev1 incident is open, how much commerce revenue is on fire per minute it stays un-resolved.

What this audit checks

Authentication & access

  • REST API token valid + has read scope (probe on /abilities)
  • Region host correct (us → api.pagerduty.com, eu → api.eu.pagerduty.com)
  • Every configured routing key resolves to an existing, active service
  • Backup channel (Slack / Teams) configured so failures never silently un-page

Routing & escalation coverage (the blind-spot test)

  • Services with a revoked or rotated Events API v2 routing key (pages go nowhere)
  • Vortex IQ severity tier (sev1 / sev2 / sev3) with no PagerDuty service mapped
  • Sev1-mapped escalation policy without a 24/7 always-on rota
  • Escalation policy with zero escalation steps (no fallback responder)
  • On-call schedule with an uncovered gap in the next 24h

Submission pipeline reliability

  • Event submission success rate below 99% (events being dropped)
  • Median submission latency above 5000ms (page delayed)
  • Retried submissions (429 / 5xx) above 2σ vs 30D baseline
  • Fail-open audit-logged events > 0 in last 24h (events the API could not accept)

Response performance

  • MTTA above 15 min (sev1 above 5 min)
  • MTTR above 4h
  • Escalation rate above 30% (first-line overloaded / under-staffed)
  • Incident volume on any service above 2× the service average (noisy surface)

Back-sync integration health

  • Webhook back-sync lag above 300s (stale Vortex IQ incident timeline)
  • Webhook subscription with a failed last delivery
  • Webhook subscription toggled inactive (state changes stop flowing back silently)

Cross-channel: revenue-at-risk paging (the killer area)

  • Open sev1 incident with a sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
  • Sev1 page un-acknowledged > 5 min during the peak trading window (worst-case missed page)
  • Incident on a service whose name matches the commerce checkout / payment path (revenue-critical)
  • Routing key revoked on the sev1 commerce-paging service (a silent un-page on the highest-value path)

Severity thresholds

SignalWarnCritical
event_success_rate_pct99.599
submission_latency_ms20005000
mtta_sec600900
mttr_sec720014400
escalation_rate_pct2030
routing_key_health_pct9995
revoked_routing_key_count11
unmapped_service_count11
schedule_gap_count11
webhook_backsync_lag_sec120300
webhook_failure_count11
fail_open_event_count11

Data sources

  • GET https://api.{region_host}/abilities - Auth + scope sanity
  • GET https://api.{region_host}/incidents - Incident inventory + ack/resolve timing + escalation count
  • GET https://api.{region_host}/services - Service inventory + routing-key status + severity-tier mapping
  • GET https://api.{region_host}/escalation_policies - Escalation policy depth + always-on rota presence
  • GET https://api.{region_host}/schedules - On-call schedule coverage gaps
  • GET https://api.{region_host}/webhook_subscriptions - Webhook delivery status + back-sync lag
  • POST https://api.{region_host}/analytics/metrics/incidents/all - Aggregated MTTA / MTTR / escalation-rate