5xx Response Rate, Datadog

Metrics type: Key Metrics • Category: Monitoring

At a glance

The percentage of HTTP responses returning a 5xx status code (500, 502, 503, 504). For a merchant, these are the unambiguous server failures: the shopper made a request, the server tried, and the server failed. Unlike the broader Error Rate (which can include 4xx errors if your team configured them as such), 5xx is purely server-side breakage. Above 0.5% something is genuinely broken.


API endpoint	Datadog Metrics API, `GET /api/v1/query` with `sum:trace.servlet.request.errors{http.status_code:5}.as_count() / sum:trace.servlet.request.hits{}.as_count() * 100`.
Metric basis	APM span counts filtered by `http.status_code:5*`, divided by total spans. Strict 5xx-only, NOT broader error tagging.
Aggregation window	1-minute rollup at source; 5-minute rolling window for the displayed value.
Severity threshold	P1 = 5xx rate above 5% (active outage); P2 = above 1% (alert trigger); P3 = above 0.3% (warning).
Alert pre-filtering	Synthetic test traffic and health-check endpoints excluded by default. Without this, a routine deploy that briefly returns 503 on `/health` skews the rate.
Log Management gating	Not used. The 5xx rate is APM-derived; the card returns valid values regardless of Logs status.
Why a separate card from Error Rate	Some teams instrument 4xx (401, 403, 404, 429) as `error:true` for visibility; this inflates the broader Error Rate above the actual server-failure rate. The 5xx card is the strict “server actually broke” view, useful when you need to know whether the elevated Error Rate reflects real breakage or instrumentation choices.
Filtered hosts / services	All instrumented services. For per-service breakdown see Errors by Endpoint.
Time zone	Account timezone for chart axes; UTC for cross-connector windowing.
Common 5xx codes	500 = generic server error (code threw an unhandled exception); 502 = bad gateway (upstream service unreachable); 503 = service unavailable (often rate-limit or capacity); 504 = gateway timeout (upstream service responded too slowly).
Time window	`T/7D vsP` (today vs prior 7-day average)
Alert trigger	`> 1%`, sustained 5xx above 1% for 5 minutes pages on-call.
Sentiment key	`error_rate`
Roles	owner, engineering

Calculation

Calculated automatically from your Datadog data. See the At a glance summary above for what the metric tracks and the worked example below for a typical reading.

Worked example

A US homewares brand on Shopify with Datadog APM. Steady-state 5xx rate sat at 0.05% (mostly transient 503 from auto-scaling events). On 21 Apr 26 at 16:30 EST a payment-processor outage at Stripe’s PSP layer caused upstream 502 responses to cascade into the brand’s checkout service.

Time (UTC, EST+5)	Total spans	5xx spans	5xx %	Top status	What was happening
21:00 (16:00 EST)	145,200	78	0.05%	503 (capacity)	Steady-state
21:30 (16:30 EST)	148,100	920	0.62%	502 (bad gateway)	Stripe PSP began returning 502
21:45	152,400	4,650	3.05%	502	Cascade through checkout service
22:00	146,800	8,820	6.01%	502, 504	Peak; payment fully unavailable
22:30	149,200	3,150	2.11%	502	Stripe began recovering
23:00	147,500	410	0.28%	503	Resolved

Apdex during this window dropped from 0.94 to 0.78. Conversion rate dropped from 1.92% to 0.43% (an 78% relative collapse). Checkout success rate dropped from 96% to 47%. Note that the broader Error Rate would have shown roughly the same shape because all of these failures are 5xx; in this incident, 5xx and Error Rate tell the same story.

Revenue impact (estimated):
  - 90 minutes above 1% 5xx
  - Baseline orders/min during peak: 14
  - Observed orders/min during incident: 5
  - Lost orders ≈ (14 − 5) × 60 × 1.5 = 810 orders
  - At AOV $74: lost revenue ≈ $59,940
  - Stripe issued post-mortem credits but did not refund the merchant for lost orders

Three takeaways merchants should remember:

5xx rate above 1% is always a paging event. Unlike latency or error rate, 5xx represents unambiguous server failure. There is no “subjective” interpretation; the response code is in the standard. The 1% threshold catches real problems while staying above the steady-state noise floor of healthy sites (typically 0.05-0.1%).
Most 5xx incidents are upstream, not your code. This brand’s checkout service did not have a bug; the bug was at Stripe’s PSP. Modern ecommerce stacks have many third-party dependencies (payment, fraud, fulfilment, search-as-a-service); when any one returns 5xx, your service surfaces it as 502 or 504. The fix is rarely “fix our code” and often “switch fallback paths” or “wait for upstream to recover”.
Pair 5xx rate with Error Rate to triangulate cause. If 5xx is up and Error Rate is up by similar amounts, the failures are real server breakage. If Error Rate is up but 5xx is flat, the failures are 4xx-marked-as-error (often auth or rate-limit). The diagnosis path is different: 5xx investigation goes upstream; 4xx investigation goes to user-facing logic.

Sibling cards merchants should reference together

Card	Why pair it with 5xx Rate	What the combination tells you
Error Rate	The broader peer (includes 4xx-marked-as-error).	5xx up plus broader Error Rate up by same amount equals real server breakage; only Error Rate up equals 4xx instrumentation choices.
Errors by Endpoint	The breakdown view: which endpoints serve the 5xx.	One endpoint dominating equals isolated bug; many endpoints equals upstream cascade.
Top Error Messages	The triage view.	Same exception dominating equals single root cause; varied messages equals broader infrastructure issue.
Apdex Score	Apdex penalises 5xx as failed satisfaction.	Apdex dropping more than 5xx predicts means latency is also degraded (cascading).
Cart Abandonment During 5xx Spikes	The merchant-impact card.	Quantifies the abandonment cost while 5xx is elevated.
Conversion Drop During Incidents	The post-incident measured-loss card.	Pair with this card to compute the financial cost of a 5xx event.
Stripe Auth Rate / PayPal Capture Success	Payment gateway peer; many 5xx incidents are upstream payment-PSP outages.	5xx up plus Stripe auth rate down equals payment-processor cascade; 5xx up plus payment OK equals your code or a different upstream.
Shopify / BC / Adobe Total Revenue	The downstream impact metric.	A sustained 5xx event above 1% should produce a measurable revenue dip within 15 minutes.

Reconciling against the vendor’s own dashboard

Where to look in Datadog:

APM → Service List filtered by HTTP status code 5xx. APM → Traces for individual 5xx trace examples. Dashboards → APM Overview for the time-series view. Logs → Live Tail filtered by status:error for the request bodies (gated by Log Management).

Why our number may legitimately differ from Datadog’s UI:

Reason	Direction	Why
Time zone	Boundary days off	Datadog UI displays in account timezone; Vortex IQ uses UTC.
API rate limits	Brief gaps	The Metrics query API is rate-limited; cached values may be 1-2 minutes stale on burst minutes.
Log indexing latency	Not applicable	5xx rate is APM-derived.
Span sampling	Both directions	Head-based sampling reduces absolute counts but the rate is unbiased; tail-based sampling that prefers errors will inflate the rate.
Status-code tagging	Vortex IQ stricter	Some teams tag synthetic 503s during graceful-shutdown as `error:false` to suppress noise; the strict 5xx query still counts them.

Cross-connector reconciliation:

Card	Expected relationship	What causes the divergence
Datadog logs	Subset relationship: error logs are typically a subset of 5xx spans (each 5xx span produces zero or one log entry).	Log rate higher than span rate equals workers/services not covered by APM emitting error logs; less than equals normal.
`stripe.stripe_payment_health_score` / `paypal.pp_payment_health_score`	Payment-PSP cascade peer.	When 5xx spikes on the checkout service AND payment-health drops simultaneously, the cause is upstream PSP outage.
`shopify.total_revenue` / `bigcommerce.total_revenue` / `adobe_commerce.total_revenue`	Inverse: 5xx above 1% sustained equals revenue drop within 15 minutes.	Revenue drop without 5xx spike equals latency-driven or RUM-side issue; 5xx spike without revenue drop equals 5xx is on non-customer paths.

Known limitations / merchant FAQs

My broader Error Rate is amber but 5xx Rate is green. What gives? Your team has instrumented some 4xx responses as error:true. Common cases: 401 (auth failure), 403 (permission denied), 429 (rate-limited). These are surfaced as errors in the broader rate but are not server breakage; they are user-facing logic. The 5xx Rate filters strictly on HTTP 5xx codes, giving you the “is the server actually broken” view. Both numbers are useful: broader Error Rate for “how many shoppers are seeing a problem”, 5xx Rate for “how many of those problems are server-side”. What is the difference between 502, 503, 504, and 500? 500 (Internal Server Error): your application threw an unhandled exception. 502 (Bad Gateway): an upstream service is unreachable; you sent a request and the upstream did not respond properly. 503 (Service Unavailable): the server is up but explicitly refusing to serve (often rate-limit, capacity, or maintenance mode). 504 (Gateway Timeout): an upstream service responded but too slowly. The status code tells you where the problem is: 500 means in your code; 502/504 mean upstream; 503 often means capacity. Why is my 5xx rate elevated only on weekends? Most likely cause: weekend auto-scaling configuration. Many merchants reduce overnight/weekend capacity to save cost, which means a single traffic spike (a viral social post, a friend-of-a-friend share) can push the smaller fleet into 503 territory. Solution: tune the auto-scaling minimum or set a different overnight threshold. Check Container Restart Storm for confirmation. Datadog says 5xx rate is fine but customers are reporting “Server Error” pages. Three causes: (1) The 5xx is on a code path Datadog is not instrumenting (a third-party widget that loads and fails on the page); (2) The 5xx is being served by your CDN before reaching the origin (Cloudflare, Fastly, AWS CloudFront), and the CDN is not feeding into Datadog; (3) The customer is hitting a stale cached error page from earlier today. Run Critical-Path Tests Status to check the synthetic-test view of customer journeys. My 5xx rate spiked for 30 seconds and then went back to normal. Should I investigate? Probably not. 30-second spikes are usually deploy-related (graceful-shutdown of old containers while new ones come up) or auto-scaling-related (brief 503 while a new replica warms up). Check the timeline against your deploy markers. If the spike coincides with a deploy, the spike is expected; if it does not, investigate. My Logs API returns 400 No valid indexes. Does this card still work? Yes. 5xx Rate is APM-derived. The Vortex IQ engine logs the gating event once at INFO level and continues serving APM-derived cards normally. Log Management gating only affects log-volume cards. Why does my 5xx rate include health-check 503s? By default, it does not. The pre-filtering excludes paths matching /health, /ping, /metrics, /status, /livez, /readyz. If your team uses a non-standard health-check path, add it to the connector’s exclusion config. A common pitfall: a custom load-balancer health-check at /_lb-health that returns 503 during graceful-shutdown will pollute the rate unless excluded. Datadog’s Service Catalog shows my 5xx by service; this card aggregates. Why? The aggregate is the right merchant headline because shoppers do not care which service returned the 5xx; they care that something did. For per-service breakdown use Errors by Endpoint. The aggregate is also stable: a single misbehaving service does not look like a fleet-wide outage. My team uses the Datadog error:true span tag, not status codes. Are 5xxs still counted? Yes. The query is http.status_code:5*, which reads the response code regardless of whether the span is also tagged error:true. Some teams set error:false on graceful-shutdown 503s to suppress alert noise; the strict 5xx Rate query still counts them, which is correct (a real 503 to a real shopper still failed their request). Why is my 5xx baseline 0.05% rather than 0%? Is the site always slightly broken? A small steady-state 5xx rate (0.01-0.1%) is normal and expected. Common sources: (1) Brief 503s during deploys and auto-scaling, (2) Connections dropped by intermediate proxies, (3) Bot traffic hitting non-existent paths in malicious-scanner patterns, (4) Tail-end retries that hit a service mid-restart. The 1% alert threshold is calibrated above this floor.

Tracked live in Vortex IQ Nerve Centre

5xx Response Rate is one of hundreds of KPI pulses Vortex IQ tracks across Datadog and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

Get Started

The AI OS

At a glance

Calculation

Worked example

Sibling cards merchants should reference together

Reconciling against the vendor’s own dashboard

Known limitations / merchant FAQs

Tracked live in Vortex IQ Nerve Centre

​At a glance

​Calculation

​Worked example

​Sibling cards merchants should reference together

​Reconciling against the vendor’s own dashboard

​Known limitations / merchant FAQs

​Tracked live in Vortex IQ Nerve Centre

At a glance

Calculation

Worked example

Sibling cards merchants should reference together

Reconciling against the vendor’s own dashboard

Known limitations / merchant FAQs

Tracked live in Vortex IQ Nerve Centre