Splunk audit profile, Vortex IQ - Vortex IQ Help Centre

Nerve Centre KPIs · Audit Profile · Sentiment Settings Splunk state means nothing to a merchant unless it’s joined to revenue. This audit answers four questions: (1) is the merchant’s stack healthy right now (services down / degraded, alerts firing, latency + error-rate in band), (2) is the on-call rotation responding fast enough (MTTA / MTTR, open-incident backlog), (3) are we hitting SLA, and (4) when a commerce- path service IS down or alerting, how much money is on fire per minute?

What this audit checks

Authentication & access

Observability API token valid (auth on /v2/organization)
Realm host correct for region (us0 / us1 / us2 / eu0 / eu1 / jp0 / au0)
Splunk On-Call API ID + key present when incidents & alerts cards are enabled
Token scope covers APM services, detectors, and SLO read

Performance & reliability

Apdex below 0.85 sustained
p95 latency above 1500ms sustained
p99 latency above 3000ms sustained
Error rate above 2% sustained
Throughput dropped > 30% WoW (capacity / outage signal)
SLA compliance below 99.5% over the reporting window

Any service in DOWN state (active outage)
More than 2 services in DEGRADED state
Commerce-path services (checkout / cart / catalogue / search) without an active detector
Detectors in paused/draft state on commerce-path services (fires silently)
Alerting concentration on a single service (top-N alerting service == commerce path)

Incident response (on-call economics)

Open incident backlog above 3 (rotation falling behind)
MTTA above 30 minutes (slow first response)
MTTR above 60 minutes (slow resolution)
Incidents acknowledged but unresolved > 24h (stuck triage)
Repeat incidents on the same service within 7 days (unfixed root cause)

Cross-channel: revenue-at-risk (the killer area)

Commerce-path service DOWN/DEGRADED with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
Alerts firing on the checkout service during peak commerce hours
Top alerting service maps to a commerce-path service (cart / checkout / catalogue / search)
Error-rate spike on a commerce service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
SLA breach window overlapping a commerce traffic peak (lost-order estimate)

Severity thresholds

Signal	Warn	Critical
`apdex`	0.9	0.85
`error_rate_pct`	1	2
`p95_latency_ms`	1000	1500
`p99_latency_ms`	2000	3000
`avg_response_ms`	500	1000
`throughput_change_pct_wow`	-15	-30
`sla_compliance_pct`	99.9	99.5
`services_degraded_count`	1	2
`services_down_count`	0	1
`incidents_open_count`	2	3
`mtta_minutes`	15	30
`mttr_minutes`	30	60

Data sources

GET https://api.{realm}.signalfx.com/v2/organization - Auth + token sanity
GET https://api.{realm}.signalfx.com/v2/apm/services - APM service inventory + health states
POST https://api.{realm}.signalfx.com/v2/signalflow - Run metric programs for latency / error-rate / throughput / apdex checks
GET https://api.{realm}.signalfx.com/v2/detector - Detector inventory + firing/active state + notification coverage
GET https://api.{realm}.signalfx.com/v2/slo - SLO state + SLA compliance
GET https://api.{realm}.signalfx.com/api-public/v1/incidents - On-Call incident inventory (MTTA / MTTR / backlog + revenue-at-risk join)
GET https://api.{realm}.signalfx.com/api-public/v1/reporting/v2/metrics - Incident reporting metrics for MTTA / MTTR aggregates

​What this audit checks

​Authentication & access

​Performance & reliability

​Service health & coverage (the blind-spot test)

​Incident response (on-call economics)

​Cross-channel: revenue-at-risk (the killer area)

​Severity thresholds

​Data sources