> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vortexiq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Datadog audit profile, Vortex IQ

> What the Vortex IQ Datadog health audit checks: Datadog: Coverage, Reliability, Cost & Revenue-at-Risk

**[Nerve Centre KPIs](/nerve-centre/kpi-cards/datadog) · [Audit Profile](/nerve-centre/kpi-cards/datadog/audit) · [Sentiment Settings](/nerve-centre/kpi-cards/datadog/sentiment)**

Datadog state means nothing to a merchant unless it's joined to revenue. This audit answers four questions: (1) is the merchant's stack healthy right now, (2) are the monitors that should be watching it actually watching it (coverage gaps, no-notification monitors, no-data drifters), (3) are we burning the SLO error budget faster than the month, and (4) when something IS broken, how much money is on fire per minute?

## What this audit checks

### Authentication & access

* API key + App key valid (auth on /api/v1/validate)
* Site host correct for region (US1 / EU1 / US3 / US5 / AP1)
* Custom-metric quota headroom > 15%
* Indexing quota headroom > 15%

### Monitor coverage (the blind-spot test)

* Services without error-rate monitor
* Services without latency monitor
* Monitors in 'No Data' state >24h (lost telemetry)
* Monitors without notification channel (fires silently)
* Monitors flapping >3 times in last 24h (noisy or wrong threshold)
* Skipped monitors not re-enabled after >7d

### Reliability & SLO health

* Apdex below 0.85 sustained > 30 min
* p95 above 1500ms sustained > 15 min
* Error rate > 2% sustained > 10 min
* Throughput dropped > 30% WoW (capacity / outage signal)
* SLO burn rate > 14.4× (fast-burn alert)
* Error budget remaining \< 20%
* SLO breach forecast within 7 days

### Synthetic & uptime

* Critical-path synthetic test failing (login / browse / cart / checkout)
* Region-specific uptime \< 99% (regional outage / CDN issue)
* Browser test latency p95 > 5000ms
* API monitor failures > 3 in 24h

### Real user monitoring (customer-experience signal)

* Page load p95 > 4000ms (conversion-killer threshold)
* Frustrated session rate > 5% (rage clicks / >4s loads)
* JS errors per session > 0.5
* Mobile-vs-desktop latency gap > 2× (mobile-experience regression)

### Logs & ingestion

* Log volume up > 30% vs prior period (cost spike or runaway logging)
* Error-level log rate > 10% of total (signal/noise drift)
* Ingestion freshness > 300s (lag = blind to live state)
* New error patterns emerging in last 24h
* Indexing cost trend up > 50% MoM

### Cost & capacity

* Custom-metric quota > 85% (impending overage charges)
* High-cardinality tag warnings (cost-amplifier flag)
* Infra spend trend > +15% MoM
* Hosts with stale agent (>24h since last report)
* Hosts with disk >90% full

### Incident response & service-health rollup

* Mean time to acknowledge (MTTA) > 30 min (paging / on-call gap)
* Mean time to resolve (MTTR) > 120 min (remediation drag)
* Any service in Down or Degraded state (service-health rollup)

### Cross-channel: revenue-at-risk (the killer area)

* Active incident with sibling commerce connector live = compute \$/min lost (commerce.revenue\_per\_min × incident\_minutes × estimated\_traffic\_loss\_pct)
* Checkout-service degradation (p95 > 3s) during peak hours
* 5xx spike during a campaign push (sibling = google\_ads / amazon\_ads / klaviyo) - paying for traffic that can't convert
* Conversion drop during incident windows (vs 90D baseline)
* Cart abandonment spike correlated with 5xx rate

## Severity thresholds

| Signal                           | Warn | Critical |
| -------------------------------- | ---- | -------- |
| `apdex`                          | 0.9  | 0.85     |
| `error_rate_pct`                 | 1    | 2        |
| `mtta_min`                       | 5    | 30       |
| `mttr_min`                       | 30   | 120      |
| `services_unhealthy_count`       | 0    | 1        |
| `p95_latency_ms`                 | 1000 | 1500     |
| `throughput_change_pct_wow`      | -15  | -30      |
| `slo_burn_rate_1h`               | 6    | 14.4     |
| `error_budget_remaining_pct`     | 30   | 20       |
| `synthetic_uptime_pct`           | 99.9 | 99.5     |
| `rum_page_load_p95_ms`           | 3000 | 4000     |
| `rum_frustrated_session_pct`     | 3    | 5        |
| `monitors_in_no_data_count`      | 1    | 5        |
| `monitors_no_notification_count` | 0    | 1        |
| `custom_metric_quota_pct`        | 75   | 85       |
| `log_volume_change_pct_vsP`      | 20   | 30       |
| `ingestion_freshness_sec`        | 120  | 300      |
| `agent_stale_hosts_count`        | 1    | 5        |
| `disk_pct_max`                   | 80   | 90       |

## Data sources

* `GET https://api.{site}/api/v1/validate` - Auth + key sanity
* `GET https://api.{site}/api/v1/monitor` - Monitor inventory + states + notification channels
* `POST https://api.{site}/api/v1/query` - Run metric queries for threshold checks
* `POST https://api.{site}/api/v2/logs/events/search` - Log volume + error-pattern detection
* `GET https://api.{site}/api/v1/synthetics/tests` - Synthetic test inventory + uptime
* `GET https://api.{site}/api/v1/slo` - SLO state + burn rate
* `GET https://api.{site}/api/v2/rum/applications` - RUM page-load p95 + frustrated sessions
* `GET https://api.{site}/api/v2/incidents` - Active incident inventory (revenue-at-risk join)
* `GET https://api.{site}/api/v1/usage/summary` - Quota usage + cost-trend signals
* `GET https://api.{site}/api/v1/hosts` - Host count + agent freshness + saturation
