What this audit checks
Authentication & access
- Base URL reachable and bearer/basic credentials accepted (probe /api/v1/status/buildinfo)
- Alertmanager URL reachable when supplied (probe /api/v2/silences)
- Query API returns within the server’s max-concurrency budget (no sustained 503s)
Scrape coverage (the blind-spot test)
- Targets in ‘down’ health (up == 0) - Prometheus is blind to them
- Targets with stale last-scrape (>2× scrape_interval since last success)
- Targets with scrape duration approaching the scrape timeout
- Jobs with zero discovered targets (service-discovery drift)
Performance & reliability
- Apdex below 0.85 sustained over the window
- Error rate above 2% (rate of 5xx / total requests)
- p95 latency above 1500ms sustained
- p99 latency above 3000ms sustained
- Throughput dropped > 30% vs prior period (capacity / outage signal)
- SLA compliance (avg_over_time(up)) below 99.5%
Alert hygiene & routing
- Alerts firing with no matching Alertmanager receiver (fires silently)
- Alert rules flapping (firing→inactive→firing) > 3× in 24h
- Active silences with no expiry or expiry > 7d (over-broad suppression)
- Critical-severity alerts firing > 30 min without acknowledgement
- Recording/alerting rule groups with evaluation errors
Incident throughput
- Mean time to acknowledge trending up vs prior period
- Mean time to resolve above the 1h warn band
- Open incident count (active alert groups) above baseline
- Services concentrating the most firing alerts (noisy-service ranking)
Cross-channel: revenue-at-risk (the killer area)
- Firing critical/warning alert on a service that maps to a commerce sibling = compute $/min lost (commerce.revenue_per_min × firing_minutes × estimated_traffic_loss_pct)
- Service-down (up == 0) during a commerce sibling’s peak traffic window
- 5xx error-rate spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- p95 latency breach on the checkout-path service during peak hours
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
apdex | 0.9 | 0.85 |
error_rate_pct | 1 | 2 |
p95_latency_ms | 1000 | 1500 |
p99_latency_ms | 1500 | 3000 |
throughput_change_pct_vsP | -15 | -30 |
sla_compliance_pct | 99.9 | 99.5 |
targets_down_count | 1 | 3 |
alerts_firing_no_receiver | 0 | 1 |
mttr_ms | 1800000 | 3600000 |
mtta_ms | 300000 | 900000 |
Data sources
GET {base_url}/api/v1/status/buildinfo- Auth + server reachability probeGET {base_url}/api/v1/query- Instant PromQL for apdex / error-rate / latency / throughput thresholdsGET {base_url}/api/v1/query_range- Range PromQL for trend / SLA / MTTA / MTTR computationGET {base_url}/api/v1/targets- Scrape-target health + last-scrape freshnessGET {base_url}/api/v1/rules- Rule-group inventory + evaluation errorsGET {alertmanager_url}/api/v2/alerts- Firing-alert inventory + severity + service labelsGET {alertmanager_url}/api/v2/alerts/groups- Active incident grouping (open-incident count)GET {alertmanager_url}/api/v2/silences- Silence hygiene + acknowledgement proxy