At a glance
The percentage of HTTP responses returning a 5xx status code (500, 502, 503, 504). For a merchant, these are the unambiguous server failures: the shopper made a request, the server tried, and the server failed. Unlike the broader Error Rate (which can include 4xx errors if your team configured them as such), 5xx is purely server-side breakage. Above 0.5% something is genuinely broken.
| API endpoint | Datadog Metrics API, GET /api/v1/query with sum:trace.servlet.request.errors{http.status_code:5*}.as_count() / sum:trace.servlet.request.hits{*}.as_count() * 100. |
| Metric basis | APM span counts filtered by http.status_code:5*, divided by total spans. Strict 5xx-only, NOT broader error tagging. |
| Aggregation window | 1-minute rollup at source; 5-minute rolling window for the displayed value. |
| Severity threshold | P1 = 5xx rate above 5% (active outage); P2 = above 1% (alert trigger); P3 = above 0.3% (warning). |
| Alert pre-filtering | Synthetic test traffic and health-check endpoints excluded by default. Without this, a routine deploy that briefly returns 503 on /health skews the rate. |
| Log Management gating | Not used. The 5xx rate is APM-derived; the card returns valid values regardless of Logs status. |
| Why a separate card from Error Rate | Some teams instrument 4xx (401, 403, 404, 429) as error:true for visibility; this inflates the broader Error Rate above the actual server-failure rate. The 5xx card is the strict “server actually broke” view, useful when you need to know whether the elevated Error Rate reflects real breakage or instrumentation choices. |
| Filtered hosts / services | All instrumented services. For per-service breakdown see Errors by Endpoint. |
| Time zone | Account timezone for chart axes; UTC for cross-connector windowing. |
| Common 5xx codes | 500 = generic server error (code threw an unhandled exception); 502 = bad gateway (upstream service unreachable); 503 = service unavailable (often rate-limit or capacity); 504 = gateway timeout (upstream service responded too slowly). |
| Time window | T/7D vsP (today vs prior 7-day average) |
| Alert trigger | > 1%, sustained 5xx above 1% for 5 minutes pages on-call. |
| Sentiment key | error_rate |
| Roles | owner, engineering |
Calculation
Calculated automatically from your Datadog data. See the At a glance summary above for what the metric tracks and the worked example below for a typical reading.Worked example
A US homewares brand on Shopify with Datadog APM. Steady-state 5xx rate sat at 0.05% (mostly transient 503 from auto-scaling events). On 21 Apr 26 at 16:30 EST a payment-processor outage at Stripe’s PSP layer caused upstream 502 responses to cascade into the brand’s checkout service.| Time (UTC, EST+5) | Total spans | 5xx spans | 5xx % | Top status | What was happening |
|---|---|---|---|---|---|
| 21:00 (16:00 EST) | 145,200 | 78 | 0.05% | 503 (capacity) | Steady-state |
| 21:30 (16:30 EST) | 148,100 | 920 | 0.62% | 502 (bad gateway) | Stripe PSP began returning 502 |
| 21:45 | 152,400 | 4,650 | 3.05% | 502 | Cascade through checkout service |
| 22:00 | 146,800 | 8,820 | 6.01% | 502, 504 | Peak; payment fully unavailable |
| 22:30 | 149,200 | 3,150 | 2.11% | 502 | Stripe began recovering |
| 23:00 | 147,500 | 410 | 0.28% | 503 | Resolved |
- 5xx rate above 1% is always a paging event. Unlike latency or error rate, 5xx represents unambiguous server failure. There is no “subjective” interpretation; the response code is in the standard. The 1% threshold catches real problems while staying above the steady-state noise floor of healthy sites (typically 0.05-0.1%).
- Most 5xx incidents are upstream, not your code. This brand’s checkout service did not have a bug; the bug was at Stripe’s PSP. Modern ecommerce stacks have many third-party dependencies (payment, fraud, fulfilment, search-as-a-service); when any one returns 5xx, your service surfaces it as 502 or 504. The fix is rarely “fix our code” and often “switch fallback paths” or “wait for upstream to recover”.
- Pair 5xx rate with Error Rate to triangulate cause. If 5xx is up and Error Rate is up by similar amounts, the failures are real server breakage. If Error Rate is up but 5xx is flat, the failures are 4xx-marked-as-error (often auth or rate-limit). The diagnosis path is different: 5xx investigation goes upstream; 4xx investigation goes to user-facing logic.
Sibling cards merchants should reference together
| Card | Why pair it with 5xx Rate | What the combination tells you |
|---|---|---|
| Error Rate | The broader peer (includes 4xx-marked-as-error). | 5xx up plus broader Error Rate up by same amount equals real server breakage; only Error Rate up equals 4xx instrumentation choices. |
| Errors by Endpoint | The breakdown view: which endpoints serve the 5xx. | One endpoint dominating equals isolated bug; many endpoints equals upstream cascade. |
| Top Error Messages | The triage view. | Same exception dominating equals single root cause; varied messages equals broader infrastructure issue. |
| Apdex Score | Apdex penalises 5xx as failed satisfaction. | Apdex dropping more than 5xx predicts means latency is also degraded (cascading). |
| Cart Abandonment During 5xx Spikes | The merchant-impact card. | Quantifies the abandonment cost while 5xx is elevated. |
| Conversion Drop During Incidents | The post-incident measured-loss card. | Pair with this card to compute the financial cost of a 5xx event. |
| Stripe Auth Rate / PayPal Capture Success | Payment gateway peer; many 5xx incidents are upstream payment-PSP outages. | 5xx up plus Stripe auth rate down equals payment-processor cascade; 5xx up plus payment OK equals your code or a different upstream. |
| Shopify / BC / Adobe Total Revenue | The downstream impact metric. | A sustained 5xx event above 1% should produce a measurable revenue dip within 15 minutes. |
Reconciling against the vendor’s own dashboard
Where to look in Datadog:
APM → Service List filtered by HTTP status code 5xx.
APM → Traces for individual 5xx trace examples.
Dashboards → APM Overview for the time-series view.
Logs → Live Tail filtered by status:error for the request bodies (gated by Log Management).
Why our number may legitimately differ from Datadog’s UI:
| Reason | Direction | Why |
|---|---|---|
| Time zone | Boundary days off | Datadog UI displays in account timezone; Vortex IQ uses UTC. |
| API rate limits | Brief gaps | The Metrics query API is rate-limited; cached values may be 1-2 minutes stale on burst minutes. |
| Log indexing latency | Not applicable | 5xx rate is APM-derived. |
| Span sampling | Both directions | Head-based sampling reduces absolute counts but the rate is unbiased; tail-based sampling that prefers errors will inflate the rate. |
| Status-code tagging | Vortex IQ stricter | Some teams tag synthetic 503s during graceful-shutdown as error:false to suppress noise; the strict 5xx query still counts them. |
| Card | Expected relationship | What causes the divergence |
|---|---|---|
| Datadog logs | Subset relationship: error logs are typically a subset of 5xx spans (each 5xx span produces zero or one log entry). | Log rate higher than span rate equals workers/services not covered by APM emitting error logs; less than equals normal. |
stripe.stripe_payment_health_score / paypal.pp_payment_health_score | Payment-PSP cascade peer. | When 5xx spikes on the checkout service AND payment-health drops simultaneously, the cause is upstream PSP outage. |
shopify.total_revenue / bigcommerce.total_revenue / adobe_commerce.total_revenue | Inverse: 5xx above 1% sustained equals revenue drop within 15 minutes. | Revenue drop without 5xx spike equals latency-driven or RUM-side issue; 5xx spike without revenue drop equals 5xx is on non-customer paths. |
Known limitations / merchant FAQs
My broader Error Rate is amber but 5xx Rate is green. What gives? Your team has instrumented some 4xx responses aserror:true. Common cases: 401 (auth failure), 403 (permission denied), 429 (rate-limited). These are surfaced as errors in the broader rate but are not server breakage; they are user-facing logic. The 5xx Rate filters strictly on HTTP 5xx codes, giving you the “is the server actually broken” view. Both numbers are useful: broader Error Rate for “how many shoppers are seeing a problem”, 5xx Rate for “how many of those problems are server-side”.
What is the difference between 502, 503, 504, and 500?
500 (Internal Server Error): your application threw an unhandled exception. 502 (Bad Gateway): an upstream service is unreachable; you sent a request and the upstream did not respond properly. 503 (Service Unavailable): the server is up but explicitly refusing to serve (often rate-limit, capacity, or maintenance mode). 504 (Gateway Timeout): an upstream service responded but too slowly. The status code tells you where the problem is: 500 means in your code; 502/504 mean upstream; 503 often means capacity.
Why is my 5xx rate elevated only on weekends?
Most likely cause: weekend auto-scaling configuration. Many merchants reduce overnight/weekend capacity to save cost, which means a single traffic spike (a viral social post, a friend-of-a-friend share) can push the smaller fleet into 503 territory. Solution: tune the auto-scaling minimum or set a different overnight threshold. Check Container Restart Storm for confirmation.
Datadog says 5xx rate is fine but customers are reporting “Server Error” pages.
Three causes: (1) The 5xx is on a code path Datadog is not instrumenting (a third-party widget that loads and fails on the page); (2) The 5xx is being served by your CDN before reaching the origin (Cloudflare, Fastly, AWS CloudFront), and the CDN is not feeding into Datadog; (3) The customer is hitting a stale cached error page from earlier today. Run Critical-Path Tests Status to check the synthetic-test view of customer journeys.
My 5xx rate spiked for 30 seconds and then went back to normal. Should I investigate?
Probably not. 30-second spikes are usually deploy-related (graceful-shutdown of old containers while new ones come up) or auto-scaling-related (brief 503 while a new replica warms up). Check the timeline against your deploy markers. If the spike coincides with a deploy, the spike is expected; if it does not, investigate.
My Logs API returns 400 No valid indexes. Does this card still work?
Yes. 5xx Rate is APM-derived. The Vortex IQ engine logs the gating event once at INFO level and continues serving APM-derived cards normally. Log Management gating only affects log-volume cards.
Why does my 5xx rate include health-check 503s?
By default, it does not. The pre-filtering excludes paths matching /health, /ping, /metrics, /status, /livez, /readyz. If your team uses a non-standard health-check path, add it to the connector’s exclusion config. A common pitfall: a custom load-balancer health-check at /_lb-health that returns 503 during graceful-shutdown will pollute the rate unless excluded.
Datadog’s Service Catalog shows my 5xx by service; this card aggregates. Why?
The aggregate is the right merchant headline because shoppers do not care which service returned the 5xx; they care that something did. For per-service breakdown use Errors by Endpoint. The aggregate is also stable: a single misbehaving service does not look like a fleet-wide outage.
My team uses the Datadog error:true span tag, not status codes. Are 5xxs still counted?
Yes. The query is http.status_code:5*, which reads the response code regardless of whether the span is also tagged error:true. Some teams set error:false on graceful-shutdown 503s to suppress alert noise; the strict 5xx Rate query still counts them, which is correct (a real 503 to a real shopper still failed their request).
Why is my 5xx baseline 0.05% rather than 0%? Is the site always slightly broken?
A small steady-state 5xx rate (0.01-0.1%) is normal and expected. Common sources: (1) Brief 503s during deploys and auto-scaling, (2) Connections dropped by intermediate proxies, (3) Bot traffic hitting non-existent paths in malicious-scanner patterns, (4) Tail-end retries that hit a service mid-restart. The 1% alert threshold is calibrated above this floor.