> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vortexiq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Grafana audit profile, Vortex IQ

> What the Vortex IQ Grafana health audit checks: Grafana: Alert Coverage, Reliability & Revenue-at-Risk

**[Nerve Centre KPIs](/nerve-centre/kpi-cards/grafana) · [Audit Profile](/nerve-centre/kpi-cards/grafana/audit) · [Sentiment Settings](/nerve-centre/kpi-cards/grafana/sentiment)**

Grafana shows the merchant pretty dashboards; this audit turns them into decisions. It answers four questions: (1) is the stack healthy right now (apdex / p95 / error rate / services up), (2) are the alert rules that should be watching it actually wired (silenced rules, rules with no contact point, NoData drifters), (3) are we keeping our SLA and resolving incidents fast enough (MTTA / MTTR), and (4) when a service IS down, how much commerce revenue is on fire per minute?

## What this audit checks

### Authentication & access

* Service-account token valid (auth on /api/user)
* base\_url reachable and is a Grafana instance
* Token has read scope on alert rules + data sources
* Org ID correct for multi-org instances

### Alert-rule coverage (the blind-spot test)

* Alert rules with no contact point / notification policy (fires silently)
* Alert rules left silenced past their silence window
* Alert rules stuck in NoData state >24h (lost telemetry / broken query)
* Alert rules in Error state (bad PromQL / datasource down)
* Services with no latency or error-rate alert rule wired up

### Reliability & performance

* Apdex below 0.85 sustained
* p95 latency above 1500ms sustained
* Error rate above 2% sustained
* Throughput dropped >30% WoW (capacity / outage signal)
* Services in degraded or down state
* SLA compliance below 99.5%

### Incident response

* MTTA above 30 min (slow acknowledgement)
* MTTR above 60 min (slow resolution)
* Incidents open longer than 24h
* Repeat incidents on the same service within 7d (unfixed root cause)

### Cross-channel: revenue-at-risk (the killer area)

* Service down with sibling commerce connector live = compute \$/min lost (commerce.revenue\_per\_min × down\_minutes × estimated\_traffic\_loss\_pct)
* Checkout-service degradation (p95 > 3s) during peak hours
* Alert storm on a service during a campaign push (sibling = google\_ads / amazon\_ads / klaviyo) - paying for traffic that can't convert
* Conversion drop during incident windows (vs 90D baseline)

## Severity thresholds

| Signal                         | Warn | Critical |
| ------------------------------ | ---- | -------- |
| `apdex`                        | 0.9  | 0.85     |
| `error_rate_pct`               | 1    | 2        |
| `p95_latency_ms`               | 1000 | 1500     |
| `p99_latency_ms`               | 1500 | 3000     |
| `throughput_change_pct_wow`    | -15  | -30      |
| `sla_compliance_pct`           | 99.9 | 99.5     |
| `alerts_firing_count`          | 1    | 5        |
| `services_degraded_count`      | 1    | 2        |
| `services_down_count`          | 0    | 1        |
| `alert_rules_no_contact_count` | 0    | 1        |
| `alert_rules_nodata_count`     | 1    | 3        |
| `mtta_sec`                     | 900  | 1800     |
| `mttr_sec`                     | 1800 | 3600     |

## Data sources

* `GET {base_url}/api/user` - Auth + token sanity
* `GET {base_url}/api/alertmanager/grafana/api/v2/alerts` - Firing / acknowledged alert instances + per-service counts
* `GET {base_url}/api/v1/provisioning/alert-rules` - Alert-rule inventory + contact points + silence state
* `GET {base_url}/api/prometheus/grafana/api/v1/rules` - Rule evaluation state (Alerting / NoData / Error)
* `POST {base_url}/api/ds/query` - PromQL / LogQL for apdex, latency, error rate, throughput, top error types
* `GET {base_url}/api/datasources` - Data-source inventory + health
* `GET {base_url}/api/v1/incidents` - Incident inventory + MTTA/MTTR timings (Incident plan)