Prometheus audit profile, Vortex IQ - Vortex IQ Help Centre

Nerve Centre KPIs · Audit Profile · Sentiment Settings A self-hosted Prometheus stack is only as good as its scrape coverage and alert routing. This audit answers four questions: (1) is the merchant’s stack healthy right now (apdex, error rate, latency, throughput), (2) is Prometheus actually scraping everything it should (down targets, stale scrapes, missing rule groups), (3) is the alert pipeline trustworthy (alerts firing without a receiver, over-broad silences, flapping rules), and (4) when something IS firing, how much commerce revenue is on fire per minute when a commerce sibling is also connected.

What this audit checks

Authentication & access

Base URL reachable and bearer/basic credentials accepted (probe /api/v1/status/buildinfo)
Alertmanager URL reachable when supplied (probe /api/v2/silences)
Query API returns within the server’s max-concurrency budget (no sustained 503s)

Targets in ‘down’ health (up == 0) - Prometheus is blind to them
Targets with stale last-scrape (>2× scrape_interval since last success)
Targets with scrape duration approaching the scrape timeout
Jobs with zero discovered targets (service-discovery drift)

Performance & reliability

Apdex below 0.85 sustained over the window
Error rate above 2% (rate of 5xx / total requests)
p95 latency above 1500ms sustained
p99 latency above 3000ms sustained
Throughput dropped > 30% vs prior period (capacity / outage signal)
SLA compliance (avg_over_time(up)) below 99.5%

Alert hygiene & routing

Alerts firing with no matching Alertmanager receiver (fires silently)
Alert rules flapping (firing→inactive→firing) > 3× in 24h
Active silences with no expiry or expiry > 7d (over-broad suppression)
Critical-severity alerts firing > 30 min without acknowledgement
Recording/alerting rule groups with evaluation errors

Incident throughput

Mean time to acknowledge trending up vs prior period
Mean time to resolve above the 1h warn band
Open incident count (active alert groups) above baseline
Services concentrating the most firing alerts (noisy-service ranking)

Cross-channel: revenue-at-risk (the killer area)

Firing critical/warning alert on a service that maps to a commerce sibling = compute $/min lost (commerce.revenue_per_min × firing_minutes × estimated_traffic_loss_pct)
Service-down (up == 0) during a commerce sibling’s peak traffic window
5xx error-rate spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
p95 latency breach on the checkout-path service during peak hours

Severity thresholds

Signal	Warn	Critical
`apdex`	0.9	0.85
`error_rate_pct`	1	2
`p95_latency_ms`	1000	1500
`p99_latency_ms`	1500	3000
`throughput_change_pct_vsP`	-15	-30
`sla_compliance_pct`	99.9	99.5
`targets_down_count`	1	3
`alerts_firing_no_receiver`	0	1
`mttr_ms`	1800000	3600000
`mtta_ms`	300000	900000

Data sources

GET {base_url}/api/v1/status/buildinfo - Auth + server reachability probe
GET {base_url}/api/v1/query - Instant PromQL for apdex / error-rate / latency / throughput thresholds
GET {base_url}/api/v1/query_range - Range PromQL for trend / SLA / MTTA / MTTR computation
GET {base_url}/api/v1/targets - Scrape-target health + last-scrape freshness
GET {base_url}/api/v1/rules - Rule-group inventory + evaluation errors
GET {alertmanager_url}/api/v2/alerts - Firing-alert inventory + severity + service labels
GET {alertmanager_url}/api/v2/alerts/groups - Active incident grouping (open-incident count)
GET {alertmanager_url}/api/v2/silences - Silence hygiene + acknowledgement proxy

​What this audit checks

​Authentication & access

​Scrape coverage (the blind-spot test)

​Performance & reliability

​Alert hygiene & routing

​Incident throughput

​Cross-channel: revenue-at-risk (the killer area)

​Severity thresholds

​Data sources