What this audit checks
Authentication & access
- Auth token valid (GET on /organizations//)
- Token carries org:read, project:read, event:read scopes
- Organization slug resolves and is not pending deletion
- Host correct (sentry.io SaaS vs self-hosted)
Alert coverage (the blind-spot test)
- Projects with no metric-alert rule (error rate / latency unmonitored)
- Alert rules with no action / notification target (fires silently)
- Active projects with zero events in 24h (lost instrumentation)
- Unresolved issues ignored / muted but still recurring after 7d
- Disabled projects still receiving events (misrouted DSN)
Reliability & performance
- Error rate above 2% sustained (release or dependency regression)
- Apdex below 0.85 (users feeling the slowness)
- p95 latency above 1500ms / p99 above 3000ms
- Throughput dropped > 30% WoW (capacity / outage signal)
- Crash-free session rate below SLA target (release health)
Incident response & SLA
- Mean time to acknowledge above 30 min (page fatigue)
- Mean time to resolve above the SLA window (60 min default)
- Open incidents older than the SLA breach window
- Top alerting services concentration (one project = most noise)
- Incident re-open rate (resolved too early, reopened within 24h)
Error quality & noise
- Top error types by occurrence count (fix-this-first ranking)
- New error types in last 24h not seen in prior 7d (regression)
- Fatal-level issue volume above baseline
- Single error class exceeding 1000 events (runaway)
Cross-channel: revenue-at-risk (the killer area)
- Open incident on a project that maps to a commerce-checkout service = compute $/min lost (commerce.revenue_per_min × open_incident_minutes × estimated_traffic_loss_pct)
- Error-rate spike on checkout-adjacent project during peak hours
- 5xx / error spike during a campaign push (sibling = google_ads / amazon_ads / klaviyo) - paying for traffic that can’t convert
- Conversion drop during incident windows (vs 90D baseline)
- Cart abandonment spike correlated with elevated error rate
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
error_rate_pct | 1 | 2 |
apdex | 0.9 | 0.85 |
p95_latency_ms | 1000 | 1500 |
p99_latency_ms | 2000 | 3000 |
throughput_change_pct_wow | -15 | -30 |
crash_free_rate_pct | 99.9 | 99.5 |
mtta_minutes | 15 | 30 |
mttr_minutes | 30 | 60 |
incidents_open_count | 1 | 3 |
services_degraded_count | 1 | 2 |
services_down_count | 0 | 1 |
projects_no_alert_rule_count | 1 | 3 |
projects_no_events_24h_count | 1 | 3 |
top_error_class_max_count | 100 | 1000 |
new_error_types_24h_count | 1 | 5 |
Data sources
GET {host}/api/0/organizations/{organization}/- Auth + org sanityGET {host}/api/0/organizations/{organization}/projects/- Project inventory + status + healthGET {host}/api/0/organizations/{organization}/events/- Apdex / latency / error-rate aggregatesGET {host}/api/0/organizations/{organization}/events-stats/- Time-series throughput + error-rate trendGET {host}/api/0/organizations/{organization}/issues/- Top error types + new/fatal issue detectionGET {host}/api/0/organizations/{organization}/incidents/- Incident inventory, MTTA/MTTR, revenue-at-risk joinGET {host}/api/0/organizations/{organization}/alert-rules/- Metric-alert coverage + firing/acknowledged stateGET {host}/api/0/organizations/{organization}/sessions/- Release health / crash-free rate for service health + SLA