What this audit checks
Authentication & access
- Service-account token valid (auth on /api/user)
- base_url reachable and is a Grafana instance
- Token has read scope on alert rules + data sources
- Org ID correct for multi-org instances
Alert-rule coverage (the blind-spot test)
- Alert rules with no contact point / notification policy (fires silently)
- Alert rules left silenced past their silence window
- Alert rules stuck in NoData state >24h (lost telemetry / broken query)
- Alert rules in Error state (bad PromQL / datasource down)
- Services with no latency or error-rate alert rule wired up
Reliability & performance
- Apdex below 0.85 sustained
- p95 latency above 1500ms sustained
- Error rate above 2% sustained
- Throughput dropped >30% WoW (capacity / outage signal)
- Services in degraded or down state
- SLA compliance below 99.5%
Incident response
- MTTA above 30 min (slow acknowledgement)
- MTTR above 60 min (slow resolution)
- Incidents open longer than 24h
- Repeat incidents on the same service within 7d (unfixed root cause)
Cross-channel: revenue-at-risk (the killer area)
- Service down with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × down_minutes × estimated_traffic_loss_pct)
- Checkout-service degradation (p95 > 3s) during peak hours
- Alert storm on a service during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- Conversion drop during incident windows (vs 90D baseline)
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
apdex | 0.9 | 0.85 |
error_rate_pct | 1 | 2 |
p95_latency_ms | 1000 | 1500 |
p99_latency_ms | 1500 | 3000 |
throughput_change_pct_wow | -15 | -30 |
sla_compliance_pct | 99.9 | 99.5 |
alerts_firing_count | 1 | 5 |
services_degraded_count | 1 | 2 |
services_down_count | 0 | 1 |
alert_rules_no_contact_count | 0 | 1 |
alert_rules_nodata_count | 1 | 3 |
mtta_sec | 900 | 1800 |
mttr_sec | 1800 | 3600 |
Data sources
GET {base_url}/api/user- Auth + token sanityGET {base_url}/api/alertmanager/grafana/api/v2/alerts- Firing / acknowledged alert instances + per-service countsGET {base_url}/api/v1/provisioning/alert-rules- Alert-rule inventory + contact points + silence stateGET {base_url}/api/prometheus/grafana/api/v1/rules- Rule evaluation state (Alerting / NoData / Error)POST {base_url}/api/ds/query- PromQL / LogQL for apdex, latency, error rate, throughput, top error typesGET {base_url}/api/datasources- Data-source inventory + healthGET {base_url}/api/v1/incidents- Incident inventory + MTTA/MTTR timings (Incident plan)