Datadog audit profile, Vortex IQ - Vortex IQ Help Centre

Nerve Centre KPIs · Audit Profile · Sentiment Settings Datadog state means nothing to a merchant unless it’s joined to revenue. This audit answers four questions: (1) is the merchant’s stack healthy right now, (2) are the monitors that should be watching it actually watching it (coverage gaps, no-notification monitors, no-data drifters), (3) are we burning the SLO error budget faster than the month, and (4) when something IS broken, how much money is on fire per minute?

What this audit checks

Authentication & access

API key + App key valid (auth on /api/v1/validate)
Site host correct for region (US1 / EU1 / US3 / US5 / AP1)
Custom-metric quota headroom > 15%
Indexing quota headroom > 15%

Services without error-rate monitor
Services without latency monitor
Monitors in ‘No Data’ state >24h (lost telemetry)
Monitors without notification channel (fires silently)
Monitors flapping >3 times in last 24h (noisy or wrong threshold)
Skipped monitors not re-enabled after >7d

Reliability & SLO health

Apdex below 0.85 sustained > 30 min
p95 above 1500ms sustained > 15 min
Error rate > 2% sustained > 10 min
Throughput dropped > 30% WoW (capacity / outage signal)
SLO burn rate > 14.4× (fast-burn alert)
Error budget remaining < 20%
SLO breach forecast within 7 days

Synthetic & uptime

Critical-path synthetic test failing (login / browse / cart / checkout)
Region-specific uptime < 99% (regional outage / CDN issue)
Browser test latency p95 > 5000ms
API monitor failures > 3 in 24h

Real user monitoring (customer-experience signal)

Page load p95 > 4000ms (conversion-killer threshold)
Frustrated session rate > 5% (rage clicks / >4s loads)
JS errors per session > 0.5
Mobile-vs-desktop latency gap > 2× (mobile-experience regression)

Logs & ingestion

Log volume up > 30% vs prior period (cost spike or runaway logging)
Error-level log rate > 10% of total (signal/noise drift)
Ingestion freshness > 300s (lag = blind to live state)
New error patterns emerging in last 24h
Indexing cost trend up > 50% MoM

Cost & capacity

Custom-metric quota > 85% (impending overage charges)
High-cardinality tag warnings (cost-amplifier flag)
Infra spend trend > +15% MoM
Hosts with stale agent (>24h since last report)
Hosts with disk >90% full

Incident response & service-health rollup

Mean time to acknowledge (MTTA) > 30 min (paging / on-call gap)
Mean time to resolve (MTTR) > 120 min (remediation drag)
Any service in Down or Degraded state (service-health rollup)

Cross-channel: revenue-at-risk (the killer area)

Active incident with sibling commerce connector live = compute $/min lost (commerce.revenue_per_min × incident_minutes × estimated_traffic_loss_pct)
Checkout-service degradation (p95 > 3s) during peak hours
5xx spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
Conversion drop during incident windows (vs 90D baseline)
Cart abandonment spike correlated with 5xx rate

Severity thresholds

Signal	Warn	Critical
`apdex`	0.9	0.85
`error_rate_pct`	1	2
`mtta_min`	5	30
`mttr_min`	30	120
`services_unhealthy_count`	0	1
`p95_latency_ms`	1000	1500
`throughput_change_pct_wow`	-15	-30
`slo_burn_rate_1h`	6	14.4
`error_budget_remaining_pct`	30	20
`synthetic_uptime_pct`	99.9	99.5
`rum_page_load_p95_ms`	3000	4000
`rum_frustrated_session_pct`	3	5
`monitors_in_no_data_count`	1	5
`monitors_no_notification_count`	0	1
`custom_metric_quota_pct`	75	85
`log_volume_change_pct_vsP`	20	30
`ingestion_freshness_sec`	120	300
`agent_stale_hosts_count`	1	5
`disk_pct_max`	80	90

Data sources

GET https://api.{site}/api/v1/validate - Auth + key sanity
GET https://api.{site}/api/v1/monitor - Monitor inventory + states + notification channels
POST https://api.{site}/api/v1/query - Run metric queries for threshold checks
POST https://api.{site}/api/v2/logs/events/search - Log volume + error-pattern detection
GET https://api.{site}/api/v1/synthetics/tests - Synthetic test inventory + uptime
GET https://api.{site}/api/v1/slo - SLO state + burn rate
GET https://api.{site}/api/v2/rum/applications - RUM page-load p95 + frustrated sessions
GET https://api.{site}/api/v2/incidents - Active incident inventory (revenue-at-risk join)
GET https://api.{site}/api/v1/usage/summary - Quota usage + cost-trend signals
GET https://api.{site}/api/v1/hosts - Host count + agent freshness + saturation

​What this audit checks

​Authentication & access

​Monitor coverage (the blind-spot test)

​Reliability & SLO health

​Synthetic & uptime

​Real user monitoring (customer-experience signal)

​Logs & ingestion

​Cost & capacity

​Incident response & service-health rollup

​Cross-channel: revenue-at-risk (the killer area)

​Severity thresholds

​Data sources