Multi-window burn-rate alerting, anything above 14.4× will eat the monthly budget in a day.
At a glance
The rate at which the merchant is consuming their monthly error budget, expressed as a multiple of the sustainable rate. A burn rate of 1× means errors are happening at exactly the rate the SLO permits over the month; 14.4× means at this rate the entire month’s error budget will be consumed in a single day. For a merchant, this is “are we running through our acceptable-error allowance faster than we should?” Above 14.4× is an emergency; it predicts SLO breach within hours.
| API endpoint | Datadog SLO API, GET /api/v1/slo/{slo_id}/history for the time-series, GET /api/v1/slo for the SLO definitions. Burn rate is computed by the engine from the time-series. |
| Metric basis | Error budget consumption per hour divided by the steady-state consumption rate the SLO permits. Steady-state for a 99.9% SLO over 30 days is 0.1% errors / 720 hours = 0.00014% / hour. A 1-hour window of 0.001% errors is 7× burn. |
| Aggregation window | 1-hour rolling window for the displayed value; multi-window alerting (1h + 5m, 6h + 30m) is configured server-side in Datadog. |
| Severity threshold | P1 = above 14.4× (will exhaust 30-day budget in 24 hours); P2 = above 6× (alert trigger); P3 = above 3× (worth investigating). The 14.4× number comes from Google SRE: 100% / (24/720 × 100%) = 30, halved for safety = 14.4×. |
| Alert pre-filtering | Synthetic test traffic and health-check endpoints excluded from SLO numerator/denominator at the SLO definition layer (configure in Datadog SLO query). |
| Log Management gating | Not used. Burn rate is computed from APM and metric data underlying the SLO; the card returns valid values regardless of Logs status. |
| Why “burn rate” instead of “errors per hour” | Burn rate normalises by your specific SLO target. A 99.9% SLO and a 99.99% SLO can both have the same raw error count yet very different burn rates because the budgets are 10x different. Burn rate makes the alert threshold portable across SLOs. |
| Multi-window alerting | Datadog’s recommended pattern: alert on (1h burn > 14.4× AND 5min burn > 14.4×) for fast-burn pages, (6h burn > 6× AND 30min burn > 6×) for slow-burn pages. Vortex IQ surfaces the 1h burn here; the multi-window logic is in Datadog. |
| Filtered hosts / services | The headline displays the highest burn-rate SLO across the account. Per-SLO breakdown lives on the table view. |
| Time zone | Account timezone for chart axes; UTC for cross-connector windowing. |
| Time window | 1H (rolling 1-hour burn rate) |
| Alert trigger | > 14.4× (fast burn), will exhaust monthly budget in 24 hours; pages on-call. |
| Roles | owner, engineering |
Calculation
Calculated automatically from your Datadog data. See the At a glance summary above for what the metric tracks and the worked example below for a typical reading.Worked example
A US apparel brand on BigCommerce with two Datadog SLOs:- SLO-CHK Checkout availability: 99.9% over 30 days. Budget: 43.2 minutes of unavailability per month.
- SLO-SRCH Search latency below 800 ms p95: 99.5% over 30 days. Budget: 3.6 hours of breach per month.
| SLO | 30-day target | Current 30-day | Budget remaining | 1h burn rate | What it means |
|---|---|---|---|---|---|
| SLO-CHK | 99.9% | 99.94% | 67% remaining | 17.2× | Will breach in 19 hours at this rate |
| SLO-SRCH | 99.5% | 99.42% | 12% remaining | 3.1× | Slow-bleed, watch but no urgent action |
- The checkout SLO is in fast-burn. A 17.2× burn rate over the next 24 hours would consume 24/720 × 17.2 = 57% of the entire monthly budget. The SLO has 67% remaining now; at 17.2× burn for 24 hours, only 10% would remain by tomorrow. Engineering must act now or risk breach within hours.
- The search SLO is in slow-bleed. 3.1× burn rate is concerning but not urgent: a deploy degraded latency slightly two days ago and 12% budget remains. Page on-call but with P3 severity. The team has time to plan a fix during normal hours rather than emergency-page someone.
- The two SLOs have very different action paths. Fast-burn (checkout) demands immediate rollback; slow-bleed (search) demands a planned investigation. Conflating them produces poor outcomes; reading them separately produces appropriate response.
- The 14.4× threshold is calibrated, not arbitrary. It comes from the Google SRE workbook: at 14.4×, an SLO over 30 days is consumed in 24 hours, which is the longest-but-still-actionable response window before breach. Below 14.4× = “you have time to plan a fix”; above = “fix now or accept the breach”.
- Burn rate makes alert thresholds portable. A 99.9% SLO and a 99.99% SLO can have the same raw error count but very different burn rates because the budgets are 10x different. Engineering teams running multiple SLOs at different targets benefit from this normalisation.
- Fast-burn vs slow-bleed need different responses. Fast-burn = “rollback now”; slow-bleed = “schedule a fix this sprint”. Mistaking one for the other produces either over-paging (treating slow-bleed as fast-burn) or under-paging (treating fast-burn as slow-bleed). Multi-window alerting is the standard practice that distinguishes them.
Sibling cards merchants should reference together
| Card | Why pair it with SLO Burn Rate | What the combination tells you |
|---|---|---|
| Error Budget Remaining | The accumulated counterpart of burn rate. | Burn rate plus budget remaining tells you days until breach. |
| SLO Compliance (current period) | The SLO’s current period state. | Compliance dropping plus high burn rate equals “actively breaching”; compliance OK plus high burn rate equals “will breach if rate continues”. |
| Days Until SLO Breach (forecast) | The forecast: at this rate, when does the SLO breach? | Forecast under 7 days plus high burn rate equals page; forecast 30+ days plus low burn rate equals safe. |
| Error Rate | The driver of burn-rate spikes for error-based SLOs. | Error rate spike plus burn rate spike equals “the cause is server-side errors”. |
| p95 Response Time | The driver for latency-based SLOs. | Latency spike plus burn rate spike on a latency SLO equals “the cause is latency degradation”. |
| Operational Health Score | The composite that includes SLO compliance as a 25%-weight component. | Composite drop plus high burn rate equals “the SLO degradation is dragging the composite”. |
| Active Incidents | A high burn rate without an open incident is the surface to action. | Burn rate plus zero incidents equals “engineering has not declared this is real yet”. |
| Shopify / BC / Adobe Total Revenue | The merchant-impact peer. | Sustained high burn on a customer-path SLO typically corresponds to revenue dip. |
Reconciling against the vendor’s own dashboard
Where to look in Datadog:SLO List for the master list with per-SLO burn rate and budget remaining. SLO Detail (any SLO) for the time-series of compliance and burn rate. Monitors → SLO Alert Templates for the multi-window burn-rate alerts.Why our number may legitimately differ from Datadog’s UI:
| Reason | Direction | Why |
|---|---|---|
| Time zone | Period-boundary effects | SLOs are defined over rolling 30-day windows in account timezone; Vortex IQ uses UTC for cross-connector arithmetic. |
| API rate limits | Brief gaps | The SLO API is rate-limited; cached values may be 2-5 minutes stale. |
| Log indexing latency | Affects log-based SLOs only | If your SLO query is log-based and Logs is gated, the SLO will read stale data. APM/metric-based SLOs are unaffected. |
| SLO calculation lag | 5-15 minutes | Datadog computes SLO compliance on a 5-minute schedule; sub-15-minute movements may trail. |
| Highest-burn aggregation | Either | Vortex IQ surfaces the highest burn rate across all SLOs as the headline; Datadog UI shows per-SLO views. The numbers match per-SLO; the headline is by design “the worst SLO right now”. |
| Card | Expected relationship | What causes the divergence |
|---|---|---|
| Datadog APM error rate / latency | The driver: SLOs are computed from APM metrics. Burn rate spikes follow APM spikes by 5-15 minutes. | A burn-rate spike without an APM spike means the SLO is consuming budget from a different source (log-based SLO, synthetic-based SLO). |
shopify.total_revenue / bigcommerce.total_revenue / adobe_commerce.total_revenue | Sustained high burn on customer-facing SLOs typically corresponds to revenue dip. | High burn on internal-service SLOs (worker, batch, admin) does not correspond to revenue dip and is correctly excluded from merchant-impact alerting if you tag those SLOs customer_facing:false. |
| Stripe / PayPal Payment Health | When a payment-PSP outage drives 5xx, both Datadog burn rate and payment-health-score drop. | Independent peers both confirming equals high-confidence real incident; only one moving equals investigate one side. |