What this audit checks
Authentication & access
- Observability API token valid (auth on /v2/organization)
- Realm host correct for region (us0 / us1 / us2 / eu0 / eu1 / jp0 / au0)
- Splunk On-Call API ID + key present when incidents & alerts cards are enabled
- Token scope covers APM services, detectors, and SLO read
Performance & reliability
- Apdex below 0.85 sustained
- p95 latency above 1500ms sustained
- p99 latency above 3000ms sustained
- Error rate above 2% sustained
- Throughput dropped > 30% WoW (capacity / outage signal)
- SLA compliance below 99.5% over the reporting window
Service health & coverage (the blind-spot test)
- Any service in DOWN state (active outage)
- More than 2 services in DEGRADED state
- Commerce-path services (checkout / cart / catalogue / search) without an active detector
- Detectors in paused/draft state on commerce-path services (fires silently)
- Alerting concentration on a single service (top-N alerting service == commerce path)
Incident response (on-call economics)
- Open incident backlog above 3 (rotation falling behind)
- MTTA above 30 minutes (slow first response)
- MTTR above 60 minutes (slow resolution)
- Incidents acknowledged but unresolved > 24h (stuck triage)
- Repeat incidents on the same service within 7 days (unfixed root cause)
Cross-channel: revenue-at-risk (the killer area)
- Commerce-path service DOWN/DEGRADED with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
- Alerts firing on the checkout service during peak commerce hours
- Top alerting service maps to a commerce-path service (cart / checkout / catalogue / search)
- Error-rate spike on a commerce service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- SLA breach window overlapping a commerce traffic peak (lost-order estimate)
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
apdex | 0.9 | 0.85 |
error_rate_pct | 1 | 2 |
p95_latency_ms | 1000 | 1500 |
p99_latency_ms | 2000 | 3000 |
avg_response_ms | 500 | 1000 |
throughput_change_pct_wow | -15 | -30 |
sla_compliance_pct | 99.9 | 99.5 |
services_degraded_count | 1 | 2 |
services_down_count | 0 | 1 |
incidents_open_count | 2 | 3 |
mtta_minutes | 15 | 30 |
mttr_minutes | 30 | 60 |
Data sources
GET https://api.{realm}.signalfx.com/v2/organization- Auth + token sanityGET https://api.{realm}.signalfx.com/v2/apm/services- APM service inventory + health statesPOST https://api.{realm}.signalfx.com/v2/signalflow- Run metric programs for latency / error-rate / throughput / apdex checksGET https://api.{realm}.signalfx.com/v2/detector- Detector inventory + firing/active state + notification coverageGET https://api.{realm}.signalfx.com/v2/slo- SLO state + SLA complianceGET https://api.{realm}.signalfx.com/api-public/v1/incidents- On-Call incident inventory (MTTA / MTTR / backlog + revenue-at-risk join)GET https://api.{realm}.signalfx.com/api-public/v1/reporting/v2/metrics- Incident reporting metrics for MTTA / MTTR aggregates