Skip to main content
Nerve Centre KPIs · Audit Profile · Sentiment Settings Elasticsearch-specific health audit. Answers six questions: (1) is auth scoped correctly and reachable over TLS; (2) is the cluster reachable and is its status green rather than yellow or red; (3) is search latency p95 within budget and are slow searches contained; (4) are replicas assigned and sync lag bounded with no unassigned or stuck-relocating shards; (5) is storage below the flood-stage watermark and JVM heap below GC pressure; (6) are snapshots running and recent enough to restore from. Cross-channel area joins search QPS, pool saturation and product-index doc counts to commerce-sibling traffic and catalog to size revenue at risk.

What this audit checks

Authentication & access

  • Cluster URL uses HTTPS (port 9243 Elastic Cloud or TLS-fronted 9200) - no plaintext credentials in transit
  • Credentials authenticate via basic auth or API key; if API key present it is preferred over user+password for service access
  • Monitoring role grants cluster:monitor/* and indices:monitor/* so stats endpoints return without 403
  • Default index pattern (database_name) scopes index-level stats correctly; fork detected (Elasticsearch vs OpenSearch / AWS IAM signing)

Connection & availability

  • GET /_cluster/health responds within timeout from the coordinating node
  • Cluster status is green (yellow = replicas missing, red = primary unallocated and data unavailable)
  • Active node count matches expected (a drop signals a lost node)
  • Pending cluster tasks from /_cluster/pending_tasks not backing up (master overload)

Query performance

  • Search latency p95 below 200ms baseline (from indices.search.query_time_in_millis / query_total delta)
  • Search latency p99 below 500ms
  • Slow-query rate below 5% of total searches against the slowlog threshold (default 1s)
  • Top 10 slow searches captured with normalised query DSL shape and target index for tuning

Replication & shards

  • Unassigned shards from /_cluster/health is 0 (any unassigned = replica data-loss risk)
  • Initializing / relocating shards not stuck above 5 sustained over 10m
  • Replica sync lag below 10s
  • Shard size skew below 25% across nodes (no hot shard); total primary+replica shard count within plan

Storage & capacity

  • Disk usage below the flood-stage watermark (default 95%; warn approaching 90%) - hitting it makes indexes read-only
  • JVM heap used below 75% (above triggers GC pressure and circuit breakers; >90% risks node OOM)
  • GC pause time below 1000ms in a 5m window; circuit breaker trips over 24h is 0
  • Bulk write rejections (thread_pool.write.rejected) over 24h is 0 and HTTP connection pool saturation below 90%

Backups & durability

  • A snapshot repository is registered and reachable
  • Last successful _snapshot run is under 72h old (from /_snapshot/_status)
  • Most recent snapshot completed with state SUCCESS (no PARTIAL / FAILED shards)

Cross-channel: revenue at risk

  • Search QPS spike with no matching ecom traffic spike (sibling = bigcommerce/shopify.sessions_per_15m) flags bot crawler load
  • Search-thread pool saturation > 90% during an ecom order burst (sibling = bigcommerce/shopify.order time-bucketed to 15m)
  • Product-index doc count drift > 100 vs sibling catalog (bigcommerce/shopify.product COUNT) signals broken product-sync to search
  • Slow searches co-occurring with a checkout-completion drop > 5pp in the same 5m window (sibling = bigcommerce/shopify.checkout)

Severity thresholds

SignalWarnCritical
connection_error_rate15
query_p95_ms200500
replication_lag_sec1030
disk_usage_pct8590
slow_query_count520

Data sources

  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_cluster/health - Cluster status (green/yellow/red), unassigned + relocating shards, node count
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_cluster/stats - Cluster-wide indices, store size, doc counts, shard totals
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_cluster/pending_tasks - Backlog of cluster-state updates (master overload signal)
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_nodes/stats - Per-node JVM heap, GC, thread pools, breakers, indexing + search timers
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_nodes/stats/http - HTTP current_open vs max_open for connection-pool saturation
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_cat/shards - Per-shard state, node placement and size for skew + unassigned detection
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_cat/indices - Per-index doc counts and store size for product-index drift
  • GET https://{cluster}.es.region.aws.elastic.cloud:9243/_snapshot/_status - In-progress + last snapshot state, age and per-shard success