> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vortexiq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prometheus audit profile, Vortex IQ

> What the Vortex IQ Prometheus health audit checks: Prometheus: Scrape Coverage, Alert Hygiene & Revenue-at-Risk

**[Nerve Centre KPIs](/nerve-centre/kpi-cards/prometheus) · [Audit Profile](/nerve-centre/kpi-cards/prometheus/audit) · [Sentiment Settings](/nerve-centre/kpi-cards/prometheus/sentiment)**

A self-hosted Prometheus stack is only as good as its scrape coverage and alert routing. This audit answers four questions: (1) is the merchant's stack healthy right now (apdex, error rate, latency, throughput), (2) is Prometheus actually scraping everything it should (down targets, stale scrapes, missing rule groups), (3) is the alert pipeline trustworthy (alerts firing without a receiver, over-broad silences, flapping rules), and (4) when something IS firing, how much commerce revenue is on fire per minute when a commerce sibling is also connected.

## What this audit checks

### Authentication & access

* Base URL reachable and bearer/basic credentials accepted (probe /api/v1/status/buildinfo)
* Alertmanager URL reachable when supplied (probe /api/v2/silences)
* Query API returns within the server's max-concurrency budget (no sustained 503s)

### Scrape coverage (the blind-spot test)

* Targets in 'down' health (up == 0) - Prometheus is blind to them
* Targets with stale last-scrape (>2× scrape\_interval since last success)
* Targets with scrape duration approaching the scrape timeout
* Jobs with zero discovered targets (service-discovery drift)

### Performance & reliability

* Apdex below 0.85 sustained over the window
* Error rate above 2% (rate of 5xx / total requests)
* p95 latency above 1500ms sustained
* p99 latency above 3000ms sustained
* Throughput dropped > 30% vs prior period (capacity / outage signal)
* SLA compliance (avg\_over\_time(up)) below 99.5%

### Alert hygiene & routing

* Alerts firing with no matching Alertmanager receiver (fires silently)
* Alert rules flapping (firing→inactive→firing) > 3× in 24h
* Active silences with no expiry or expiry > 7d (over-broad suppression)
* Critical-severity alerts firing > 30 min without acknowledgement
* Recording/alerting rule groups with evaluation errors

### Incident throughput

* Mean time to acknowledge trending up vs prior period
* Mean time to resolve above the 1h warn band
* Open incident count (active alert groups) above baseline
* Services concentrating the most firing alerts (noisy-service ranking)

### Cross-channel: revenue-at-risk (the killer area)

* Firing critical/warning alert on a service that maps to a commerce sibling = compute \$/min lost (commerce.revenue\_per\_min × firing\_minutes × estimated\_traffic\_loss\_pct)
* Service-down (up == 0) during a commerce sibling's peak traffic window
* 5xx error-rate spike during a campaign push (sibling = google\_ads / amazon\_ads / klaviyo) - paying for traffic that can't convert
* p95 latency breach on the checkout-path service during peak hours

## Severity thresholds

| Signal                      | Warn    | Critical |
| --------------------------- | ------- | -------- |
| `apdex`                     | 0.9     | 0.85     |
| `error_rate_pct`            | 1       | 2        |
| `p95_latency_ms`            | 1000    | 1500     |
| `p99_latency_ms`            | 1500    | 3000     |
| `throughput_change_pct_vsP` | -15     | -30      |
| `sla_compliance_pct`        | 99.9    | 99.5     |
| `targets_down_count`        | 1       | 3        |
| `alerts_firing_no_receiver` | 0       | 1        |
| `mttr_ms`                   | 1800000 | 3600000  |
| `mtta_ms`                   | 300000  | 900000   |

## Data sources

* `GET {base_url}/api/v1/status/buildinfo` - Auth + server reachability probe
* `GET {base_url}/api/v1/query` - Instant PromQL for apdex / error-rate / latency / throughput thresholds
* `GET {base_url}/api/v1/query_range` - Range PromQL for trend / SLA / MTTA / MTTR computation
* `GET {base_url}/api/v1/targets` - Scrape-target health + last-scrape freshness
* `GET {base_url}/api/v1/rules` - Rule-group inventory + evaluation errors
* `GET {alertmanager_url}/api/v2/alerts` - Firing-alert inventory + severity + service labels
* `GET {alertmanager_url}/api/v2/alerts/groups` - Active incident grouping (open-incident count)
* `GET {alertmanager_url}/api/v2/silences` - Silence hygiene + acknowledgement proxy
