Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings Grafana shows the merchant pretty dashboards; this audit turns them into decisions. It answers four questions: (1) is the stack healthy right now (apdex / p95 / error rate / services up), (2) are the alert rules that should be watching it actually wired (silenced rules, rules with no contact point, NoData drifters), (3) are we keeping our SLA and resolving incidents fast enough (MTTA / MTTR), and (4) when a service IS down, how much commerce revenue is on fire per minute?

What this audit checks

Authentication & access

  • Service-account token valid (auth on /api/user)
  • base_url reachable and is a Grafana instance
  • Token has read scope on alert rules + data sources
  • Org ID correct for multi-org instances

Alert-rule coverage (the blind-spot test)

  • Alert rules with no contact point / notification policy (fires silently)
  • Alert rules left silenced past their silence window
  • Alert rules stuck in NoData state >24h (lost telemetry / broken query)
  • Alert rules in Error state (bad PromQL / datasource down)
  • Services with no latency or error-rate alert rule wired up

Reliability & performance

  • Apdex below 0.85 sustained
  • p95 latency above 1500ms sustained
  • Error rate above 2% sustained
  • Throughput dropped >30% WoW (capacity / outage signal)
  • Services in degraded or down state
  • SLA compliance below 99.5%

Incident response

  • MTTA above 30 min (slow acknowledgement)
  • MTTR above 60 min (slow resolution)
  • Incidents open longer than 24h
  • Repeat incidents on the same service within 7d (unfixed root cause)

Cross-channel: revenue-at-risk (the killer area)

  • Service down with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
  • Checkout-service degradation (p95 > 3s) during peak hours
  • Alert storm on a service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
  • Conversion drop during incident windows (vs 90D baseline)

Severity thresholds

SignalWarnCritical
apdex0.90.85
error_rate_pct12
p95_latency_ms10001500
p99_latency_ms15003000
throughput_change_pct_wow-15-30
sla_compliance_pct99.999.5
alerts_firing_count15
services_degraded_count12
services_down_count01
alert_rules_no_contact_count01
alert_rules_nodata_count13
mtta_sec9001800
mttr_sec18003600

Data sources

  • GET {base_url}/api/user - Auth + token sanity
  • GET {base_url}/api/alertmanager/grafana/api/v2/alerts - Firing / acknowledged alert instances + per-service counts
  • GET {base_url}/api/v1/provisioning/alert-rules - Alert-rule inventory + contact points + silence state
  • GET {base_url}/api/prometheus/grafana/api/v1/rules - Rule evaluation state (Alerting / NoData / Error)
  • POST {base_url}/api/ds/query - PromQL / LogQL for apdex, latency, error rate, throughput, top error types
  • GET {base_url}/api/datasources - Data-source inventory + health
  • GET {base_url}/api/v1/incidents - Incident inventory + MTTA/MTTR timings (Incident plan)