Grafana audit profile, Vortex IQ - Vortex IQ Help Centre

Nerve Centre KPIs · Audit Profile · Sentiment Settings Grafana shows the merchant pretty dashboards; this audit turns them into decisions. It answers four questions: (1) is the stack healthy right now (apdex / p95 / error rate / services up), (2) are the alert rules that should be watching it actually wired (silenced rules, rules with no contact point, NoData drifters), (3) are we keeping our SLA and resolving incidents fast enough (MTTA / MTTR), and (4) when a service IS down, how much commerce revenue is on fire per minute?

What this audit checks

Authentication & access

Service-account token valid (auth on /api/user)
base_url reachable and is a Grafana instance
Token has read scope on alert rules + data sources
Org ID correct for multi-org instances

Alert rules with no contact point / notification policy (fires silently)
Alert rules left silenced past their silence window
Alert rules stuck in NoData state >24h (lost telemetry / broken query)
Alert rules in Error state (bad PromQL / datasource down)
Services with no latency or error-rate alert rule wired up

Reliability & performance

Apdex below 0.85 sustained
p95 latency above 1500ms sustained
Error rate above 2% sustained
Throughput dropped >30% WoW (capacity / outage signal)
Services in degraded or down state
SLA compliance below 99.5%

Incident response

MTTA above 30 min (slow acknowledgement)
MTTR above 60 min (slow resolution)
Incidents open longer than 24h
Repeat incidents on the same service within 7d (unfixed root cause)

Cross-channel: revenue-at-risk (the killer area)

Service down with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
Checkout-service degradation (p95 > 3s) during peak hours
Alert storm on a service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
Conversion drop during incident windows (vs 90D baseline)

Severity thresholds

Signal	Warn	Critical
`apdex`	0.9	0.85
`error_rate_pct`	1	2
`p95_latency_ms`	1000	1500
`p99_latency_ms`	1500	3000
`throughput_change_pct_wow`	-15	-30
`sla_compliance_pct`	99.9	99.5
`alerts_firing_count`	1	5
`services_degraded_count`	1	2
`services_down_count`	0	1
`alert_rules_no_contact_count`	0	1
`alert_rules_nodata_count`	1	3
`mtta_sec`	900	1800
`mttr_sec`	1800	3600

Data sources

GET {base_url}/api/user - Auth + token sanity
GET {base_url}/api/alertmanager/grafana/api/v2/alerts - Firing / acknowledged alert instances + per-service counts
GET {base_url}/api/v1/provisioning/alert-rules - Alert-rule inventory + contact points + silence state
GET {base_url}/api/prometheus/grafana/api/v1/rules - Rule evaluation state (Alerting / NoData / Error)
POST {base_url}/api/ds/query - PromQL / LogQL for apdex, latency, error rate, throughput, top error types
GET {base_url}/api/datasources - Data-source inventory + health
GET {base_url}/api/v1/incidents - Incident inventory + MTTA/MTTR timings (Incident plan)

​What this audit checks

​Authentication & access

​Alert-rule coverage (the blind-spot test)

​Reliability & performance

​Incident response

​Cross-channel: revenue-at-risk (the killer area)

​Severity thresholds

​Data sources