Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings Splunk state means nothing to a merchant unless it’s joined to revenue. This audit answers four questions: (1) is the merchant’s stack healthy right now (services down / degraded, alerts firing, latency + error-rate in band), (2) is the on-call rotation responding fast enough (MTTA / MTTR, open-incident backlog), (3) are we hitting SLA, and (4) when a commerce- path service IS down or alerting, how much money is on fire per minute?

What this audit checks

Authentication & access

  • Observability API token valid (auth on /v2/organization)
  • Realm host correct for region (us0 / us1 / us2 / eu0 / eu1 / jp0 / au0)
  • Splunk On-Call API ID + key present when incidents & alerts cards are enabled
  • Token scope covers APM services, detectors, and SLO read

Performance & reliability

  • Apdex below 0.85 sustained
  • p95 latency above 1500ms sustained
  • p99 latency above 3000ms sustained
  • Error rate above 2% sustained
  • Throughput dropped > 30% WoW (capacity / outage signal)
  • SLA compliance below 99.5% over the reporting window

Service health & coverage (the blind-spot test)

  • Any service in DOWN state (active outage)
  • More than 2 services in DEGRADED state
  • Commerce-path services (checkout / cart / catalogue / search) without an active detector
  • Detectors in paused/draft state on commerce-path services (fires silently)
  • Alerting concentration on a single service (top-N alerting service == commerce path)

Incident response (on-call economics)

  • Open incident backlog above 3 (rotation falling behind)
  • MTTA above 30 minutes (slow first response)
  • MTTR above 60 minutes (slow resolution)
  • Incidents acknowledged but unresolved > 24h (stuck triage)
  • Repeat incidents on the same service within 7 days (unfixed root cause)

Cross-channel: revenue-at-risk (the killer area)

  • Commerce-path service DOWN/DEGRADED with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
  • Alerts firing on the checkout service during peak commerce hours
  • Top alerting service maps to a commerce-path service (cart / checkout / catalogue / search)
  • Error-rate spike on a commerce service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
  • SLA breach window overlapping a commerce traffic peak (lost-order estimate)

Severity thresholds

SignalWarnCritical
apdex0.90.85
error_rate_pct12
p95_latency_ms10001500
p99_latency_ms20003000
avg_response_ms5001000
throughput_change_pct_wow-15-30
sla_compliance_pct99.999.5
services_degraded_count12
services_down_count01
incidents_open_count23
mtta_minutes1530
mttr_minutes3060

Data sources

  • GET https://api.{realm}.signalfx.com/v2/organization - Auth + token sanity
  • GET https://api.{realm}.signalfx.com/v2/apm/services - APM service inventory + health states
  • POST https://api.{realm}.signalfx.com/v2/signalflow - Run metric programs for latency / error-rate / throughput / apdex checks
  • GET https://api.{realm}.signalfx.com/v2/detector - Detector inventory + firing/active state + notification coverage
  • GET https://api.{realm}.signalfx.com/v2/slo - SLO state + SLA compliance
  • GET https://api.{realm}.signalfx.com/api-public/v1/incidents - On-Call incident inventory (MTTA / MTTR / backlog + revenue-at-risk join)
  • GET https://api.{realm}.signalfx.com/api-public/v1/reporting/v2/metrics - Incident reporting metrics for MTTA / MTTR aggregates