Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings Datadog state means nothing to a merchant unless it’s joined to revenue. This audit answers four questions: (1) is the merchant’s stack healthy right now, (2) are the monitors that should be watching it actually watching it (coverage gaps, no-notification monitors, no-data drifters), (3) are we burning the SLO error budget faster than the month, and (4) when something IS broken, how much money is on fire per minute?

What this audit checks

Authentication & access

  • API key + App key valid (auth on /api/v1/validate)
  • Site host correct for region (US1 / EU1 / US3 / US5 / AP1)
  • Custom-metric quota headroom > 15%
  • Indexing quota headroom > 15%

Monitor coverage (the blind-spot test)

  • Services without error-rate monitor
  • Services without latency monitor
  • Monitors in ‘No Data’ state >24h (lost telemetry)
  • Monitors without notification channel (fires silently)
  • Monitors flapping >3 times in last 24h (noisy or wrong threshold)
  • Skipped monitors not re-enabled after >7d

Reliability & SLO health

  • Apdex below 0.85 sustained > 30 min
  • p95 above 1500ms sustained > 15 min
  • Error rate > 2% sustained > 10 min
  • Throughput dropped > 30% WoW (capacity / outage signal)
  • SLO burn rate > 14.4× (fast-burn alert)
  • Error budget remaining < 20%
  • SLO breach forecast within 7 days

Synthetic & uptime

  • Critical-path synthetic test failing (login / browse / cart / checkout)
  • Region-specific uptime < 99% (regional outage / CDN issue)
  • Browser test latency p95 > 5000ms
  • API monitor failures > 3 in 24h

Real user monitoring (customer-experience signal)

  • Page load p95 > 4000ms (conversion-killer threshold)
  • Frustrated session rate > 5% (rage clicks / >4s loads)
  • JS errors per session > 0.5
  • Mobile-vs-desktop latency gap > 2× (mobile-experience regression)

Logs & ingestion

  • Log volume up > 30% vs prior period (cost spike or runaway logging)
  • Error-level log rate > 10% of total (signal/noise drift)
  • Ingestion freshness > 300s (lag = blind to live state)
  • New error patterns emerging in last 24h
  • Indexing cost trend up > 50% MoM

Cost & capacity

  • Custom-metric quota > 85% (impending overage charges)
  • High-cardinality tag warnings (cost-amplifier flag)
  • Infra spend trend > +15% MoM
  • Hosts with stale agent (>24h since last report)
  • Hosts with disk >90% full

Incident response & service-health rollup

  • Mean time to acknowledge (MTTA) > 30 min (paging / on-call gap)
  • Mean time to resolve (MTTR) > 120 min (remediation drag)
  • Any service in Down or Degraded state (service-health rollup)

Cross-channel: revenue-at-risk (the killer area)

  • Active incident with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
  • Checkout-service degradation (p95 > 3s) during peak hours
  • 5xx spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
  • Conversion drop during incident windows (vs 90D baseline)
  • Cart abandonment spike correlated with 5xx rate

Severity thresholds

SignalWarnCritical
apdex0.90.85
error_rate_pct12
mtta_min530
mttr_min30120
services_unhealthy_count01
p95_latency_ms10001500
throughput_change_pct_wow-15-30
slo_burn_rate_1h614.4
error_budget_remaining_pct3020
synthetic_uptime_pct99.999.5
rum_page_load_p95_ms30004000
rum_frustrated_session_pct35
monitors_in_no_data_count15
monitors_no_notification_count01
custom_metric_quota_pct7585
log_volume_change_pct_vsP2030
ingestion_freshness_sec120300
agent_stale_hosts_count15
disk_pct_max8090

Data sources

  • GET https://api.{site}/api/v1/validate - Auth + key sanity
  • GET https://api.{site}/api/v1/monitor - Monitor inventory + states + notification channels
  • POST https://api.{site}/api/v1/query - Run metric queries for threshold checks
  • POST https://api.{site}/api/v2/logs/events/search - Log volume + error-pattern detection
  • GET https://api.{site}/api/v1/synthetics/tests - Synthetic test inventory + uptime
  • GET https://api.{site}/api/v1/slo - SLO state + burn rate
  • GET https://api.{site}/api/v2/rum/applications - RUM page-load p95 + frustrated sessions
  • GET https://api.{site}/api/v2/incidents - Active incident inventory (revenue-at-risk join)
  • GET https://api.{site}/api/v1/usage/summary - Quota usage + cost-trend signals
  • GET https://api.{site}/api/v1/hosts - Host count + agent freshness + saturation