What this audit checks
Authentication & access
- API key + App key valid (auth on /api/v1/validate)
- Site host correct for region (US1 / EU1 / US3 / US5 / AP1)
- Custom-metric quota headroom > 15%
- Indexing quota headroom > 15%
Monitor coverage (the blind-spot test)
- Services without error-rate monitor
- Services without latency monitor
- Monitors in ‘No Data’ state >24h (lost telemetry)
- Monitors without notification channel (fires silently)
- Monitors flapping >3 times in last 24h (noisy or wrong threshold)
- Skipped monitors not re-enabled after >7d
Reliability & SLO health
- Apdex below 0.85 sustained > 30 min
- p95 above 1500ms sustained > 15 min
- Error rate > 2% sustained > 10 min
- Throughput dropped > 30% WoW (capacity / outage signal)
- SLO burn rate > 14.4× (fast-burn alert)
- Error budget remaining < 20%
- SLO breach forecast within 7 days
Synthetic & uptime
- Critical-path synthetic test failing (login / browse / cart / checkout)
- Region-specific uptime < 99% (regional outage / CDN issue)
- Browser test latency p95 > 5000ms
- API monitor failures > 3 in 24h
Real user monitoring (customer-experience signal)
- Page load p95 > 4000ms (conversion-killer threshold)
- Frustrated session rate > 5% (rage clicks / >4s loads)
- JS errors per session > 0.5
- Mobile-vs-desktop latency gap > 2× (mobile-experience regression)
Logs & ingestion
- Log volume up > 30% vs prior period (cost spike or runaway logging)
- Error-level log rate > 10% of total (signal/noise drift)
- Ingestion freshness > 300s (lag = blind to live state)
- New error patterns emerging in last 24h
- Indexing cost trend up > 50% MoM
Cost & capacity
- Custom-metric quota > 85% (impending overage charges)
- High-cardinality tag warnings (cost-amplifier flag)
- Infra spend trend > +15% MoM
- Hosts with stale agent (>24h since last report)
- Hosts with disk >90% full
Incident response & service-health rollup
- Mean time to acknowledge (MTTA) > 30 min (paging / on-call gap)
- Mean time to resolve (MTTR) > 120 min (remediation drag)
- Any service in Down or Degraded state (service-health rollup)
Cross-channel: revenue-at-risk (the killer area)
- Active incident with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
- Checkout-service degradation (p95 > 3s) during peak hours
- 5xx spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- Conversion drop during incident windows (vs 90D baseline)
- Cart abandonment spike correlated with 5xx rate
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
apdex | 0.9 | 0.85 |
error_rate_pct | 1 | 2 |
mtta_min | 5 | 30 |
mttr_min | 30 | 120 |
services_unhealthy_count | 0 | 1 |
p95_latency_ms | 1000 | 1500 |
throughput_change_pct_wow | -15 | -30 |
slo_burn_rate_1h | 6 | 14.4 |
error_budget_remaining_pct | 30 | 20 |
synthetic_uptime_pct | 99.9 | 99.5 |
rum_page_load_p95_ms | 3000 | 4000 |
rum_frustrated_session_pct | 3 | 5 |
monitors_in_no_data_count | 1 | 5 |
monitors_no_notification_count | 0 | 1 |
custom_metric_quota_pct | 75 | 85 |
log_volume_change_pct_vsP | 20 | 30 |
ingestion_freshness_sec | 120 | 300 |
agent_stale_hosts_count | 1 | 5 |
disk_pct_max | 80 | 90 |
Data sources
GET https://api.{site}/api/v1/validate- Auth + key sanityGET https://api.{site}/api/v1/monitor- Monitor inventory + states + notification channelsPOST https://api.{site}/api/v1/query- Run metric queries for threshold checksPOST https://api.{site}/api/v2/logs/events/search- Log volume + error-pattern detectionGET https://api.{site}/api/v1/synthetics/tests- Synthetic test inventory + uptimeGET https://api.{site}/api/v1/slo- SLO state + burn rateGET https://api.{site}/api/v2/rum/applications- RUM page-load p95 + frustrated sessionsGET https://api.{site}/api/v2/incidents- Active incident inventory (revenue-at-risk join)GET https://api.{site}/api/v1/usage/summary- Quota usage + cost-trend signalsGET https://api.{site}/api/v1/hosts- Host count + agent freshness + saturation