Multi-window burn-rate alerting, anything above 14.4x will eat the monthly budget in a day.
At a glance
The rate at which you’re consuming your error budget, expressed as a multiple of the budget-spend-rate that would land you exactly at the SLO target by month-end. 1x = on-track, 14.4x = monthly budget burns in a day, 36x = daily budget burns in 40 minutes. The Google SRE-style fast-burn signal that tells you whether you can afford another minute of degradation.
| What it counts | burn_rate = current_error_rate / (1 - slo_target) evaluated on a 1-hour rolling window. For a 99.9% SLO target, the allowable error rate is 0.1%; an actual error rate of 1.44% over the past hour produces 1.44 / 0.1 = 14.4x burn rate. |
| NerdGraph endpoint | NRQL via NerdGraph on Transaction events: SELECT percentage(count(*), WHERE error IS true) FROM Transaction WHERE appName = 'X' SINCE 1 HOUR AGO. The result divides by (1 - slo_target) from the Service Level entity configured in NR. |
| Metric basis | Ratio of bad events (errors or latency-violations) to total events. The “bad” definition follows the SLI configured on the Service Level entity, e.g., “errors only”, “latency >1500ms only”, or “errors OR latency >1500ms”. |
| Aggregation window | 1-hour rolling for the live KPI; multi-window evaluation (5m / 1h / 6h / 24h) for graduated alerting following Google’s SRE workbook recommendations. |
| Browser vs APM scope | APM-only by default. Browser SLIs can be configured separately by pointing the SLI at PageView events; this card reads whichever SLI the merchant has configured on their NR Service Level entity. |
| Severity threshold | The 14.4x threshold corresponds to “you’ll exhaust your monthly error budget in 24 hours of sustained burn”. Lower thresholds (1x, 6x) are slow-burn warnings; 14.4x is fast-burn paging. 36x is “drop everything” territory. |
| Filtered hosts / services | Per Service Level entity scope. One card per SLO; merchants typically configure 3, 5 SLOs (storefront availability, checkout latency, payment success rate, search response time). |
| Sample basis | Sample-corrected on high-cardinality accounts. Burn rate is a ratio so sampling preserves accuracy. |
| Time zone | UTC for the window evaluation; account timezone for chart display. |
| Time window | 24H (rolling 24-hour view of burn rate over time) |
| Alert trigger | >14.4x (fast burn). Multi-window confirmation (must breach on both 5m and 1h windows before paging) reduces false alarms. |
| Roles | owner, engineering |
Calculation
Calculated automatically from your New Relic data. See the At a glance summary above for what the metric tracks and the worked example below for a typical reading.Worked example
A merchant has configured an SLO on thecheckout-api service: 99.9% availability, measured as the ratio of non-5xx responses to total responses over a rolling 30-day window. The 30-day error budget = 0.1% x total_requests. With ~12M checkout requests/month, the budget = 12,000 errors.
Reading the card at three points across an evening:
| Time | 1h error rate | Burn rate | Interpretation |
|---|---|---|---|
| 18:00 | 0.06% | 0.6x | Burning slower than budget allows. Earning credit. |
| 20:00 | 0.18% | 1.8x | Burning ~2x budget; sustained 1, 2 days = warning territory. |
| 21:30 | 1.58% | 15.8x | Fast-burn alert fires. Sustained 24 hours = monthly budget gone. |
14.4 x 0.1% x 12M / 30 days = 14.4 x 12,000 / 30 = 5,760 errors/day = 169,560 errors over 30 days, far above the 12,000 budget. The alert fires because the trajectory if the current rate continues is unrecoverable.
Why the multi-window check matters. A single bad 5-minute window can spike the 1h burn rate to 30x briefly even if the underlying 5m rate has already recovered. Multi-window alerting (must be >14.4x on both 5m and 1h before paging) prevents this kind of flutter. The Google SRE workbook recommends:
| Severity | 1h burn | 5m burn | Lookback to alert |
|---|---|---|---|
| Fast burn | >14.4x | >14.4x | 5 minutes |
| Medium burn | >6x | >6x | 30 minutes |
| Slow burn | >1x | >1x | 6 hours |
checkout-api, ~158 customers per 10,000 requests are hitting a 5xx during checkout. At ~140 checkout requests / minute, that’s ~2.2 failed checkouts / minute, or ~133 / hour. At an average AOV of £85 and ~80% recovery probability (some customers retry), the irrecoverable revenue loss is 133 x £85 x 20% = £2,261 / hour. Sustained for 4 hours = ~£9,000 of GMV impact, plus reputational drag. The fast-burn alert is calibrated to fire well before this becomes catastrophic.
If the underlying issue resolves and the next 1h window settles to 0.4% error rate, the burn rate drops to 4x, still over budget but no longer fast-burn. The card returns to green when the 1h burn drops below 1x; until then the message is “you’re spending faster than you can rebuild”.
Sibling cards merchants should reference together
| Card | Why pair it with SLO Burn Rate |
|---|---|
| SLO Compliance (current period) | Outcome metric. Burn rate is the velocity; compliance is the position. Sustained high burn = compliance drops. |
| Error Budget Remaining | The fuel gauge. Burn rate tells you how fast you’re spending; budget remaining tells you what’s left. |
| Days Until SLO Breach (forecast) | Forward-projection. Combines current burn with remaining budget to estimate when you’d breach. |
| Error Rate | The numerator of burn rate. Open this when burn rate spikes. |
| Operational Health Score | Composite parent. SLO compliance is 25% of the score; sustained high burn pulls the score down. |
| Datadog SLO Burn Rate | Cross-connector peer. DD SLOs implement the same SRE-workbook math; numbers should match if SLI definitions are aligned. |
| Shopify Checkout Conversion | Customer-side outcome. Sustained fast-burn on availability SLOs typically drops checkout conversion 5, 15%. |
| GA4 Conversion Rate | Whole-funnel outcome. Pairs to quantify whether the SLO breach is felt by real customers. |
Reconciling against the vendor’s own dashboard
Where to look in New Relic:- Service Levels is the canonical screen. Each Service Level entity has a “Burn rate” tab that displays the same multi-window evaluation.
- Alerts & AI > Conditions for the alert configuration that backs the 14.4x threshold.
- Dashboards > “SLO overview” pre-built dashboard.
| Reason | Direction of divergence |
|---|---|
| Account timezone vs UTC. NR Service Levels follow the account timezone for display; Vortex IQ NRQL runs in UTC. Boundary-window rollups can drift 0.1, 0.3x. | Either direction at hour boundaries |
| NRQL retention windows. Burn rate is a 1h rolling window so retention isn’t a concern for live evaluation. Historical burn-rate trend over 30+ days uses rolled aggregates. | None for live |
| Ingest sampling. Burn rate is a ratio so sampling preserves accuracy. Counts may differ from raw logs but the rate stays correct. | None |
| NerdGraph rate limits. Default 3,000 points / minute / account. Burn rate queries are cheap; rate-limiting only relevant during heavy investigation. | Stale, not wrong |
| SLI definition scope. Vortex IQ reads the SLI configured on the Service Level entity. If the merchant edits the SLI definition mid-period, the historical numbers shown in NR’s UI may use old SLI; Vortex IQ uses current. | Either direction during transitions |