What this audit checks
Authentication & access
- Connection string uses mongodb+srv:// (Atlas) or mongodb:// with replica set seed list, not a single bare host
- Monitoring user holds the clusterMonitor built-in role plus read on the monitored DBs (slow-op KPIs unusable otherwise)
- SCRAM-SHA-256 auth in effect (MongoDB 4.0+ default) and TLS enforced for Atlas connections
- Atlas Admin API key (key ID + private key) present and project-scoped when cluster is Atlas-managed
Connection & availability
- db.serverStatus() reachable and instance uptime present (no recent unexpected restart)
- Connection errors over 24h below threshold (connections refused / network resets)
- Connection pool saturation under 90% - connections.current / (connections.current + connections.available)
- Active reader / writer queue (globalLock.activeClients) not backing up under load
Query performance
- Query latency p95 under 200ms from serverStatus latencies.reads.latency / ops
- Query latency p99 under 500ms (tail latency not masking p95)
- Slow ops in trailing 15m under 10 - profiler entries with millis above slowms (default 100ms)
- COLLSCAN operations over 24h under 10 - full collection scans signal missing indexes or an unindexed code path
Replication & lag
- Every secondary replica lag under 10s from rs.status() optimeDate delta vs primary
- No member stuck in RECOVERING / STARTUP2 / DOWN state (stateStr healthy across the set)
- Elections over 24h at most 1 - frequent elections indicate primary flapping from network or hardware instability
- Sharded clusters: chunk-balance skew under 20% and pending chunk migrations bounded (sh.status())
Storage & capacity
- Database disk usage under 90% from db.stats() storage and Atlas capacity surface
- WiredTiger cache hit rate at or above 95% - 1 - (bytes-read-into-cache / bytes-currently-in-cache)
- WiredTiger dirty cache under 20% of configured maximum (above triggers eviction pressure)
- Resident memory (mem.resident) within tier headroom for the Atlas instance class
Backups & durability
- Last successful backup under 72h - mongodump, Atlas continuous backup, or snapshot
- Atlas Cloud Backup enabled with a retention policy when cluster is Atlas-managed
- Oplog window long enough to cover the backup cadence plus restore lead time
- Write concern w:majority in effect for durability-critical writes (per connection string and app defaults)
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
connection_error_rate | 1 | 5 |
query_p95_ms | 200 | 500 |
replication_lag_sec | 10 | 30 |
disk_usage_pct | 80 | 90 |
slow_query_count | 10 | 25 |
Data sources
GET mongodb://{host}:{port}/{database}- Base connection - replica set or Atlas srv seed listGET db.serverStatus()- opcounters, connections, latencies, globalLock, mem, WiredTiger cacheGET db.stats()- Storage size, data size, index size for disk usageGET db.system.profile.find()- Slow-op profiler entries - requires setProfilingLevel(2) or slowms thresholdGET rs.status()- Replica set member states, optimeDate lag, election historyGET sh.status()- Shard balance, pending chunk migrations (sharded clusters only)GET db.currentOp()- In-flight long-running operations and collection scans