What this audit checks
Authentication & access
- Connection URL / username / password authenticate against the HTTP interface (8123 self-hosted, 8443 Cloud)
- User has SELECT on system.metrics, system.events, system.query_log, system.parts, system.replicas and system.backups
- TLS enforced on the endpoint for ClickHouse Cloud (no plaintext 8123 to a Cloud service)
- SELECT version() and SELECT uptime() resolve - instance identity and build options readable
Connection & availability
- Instance answers a SELECT 1 ping within the collector poll window (30-60s)
- Connection pool saturation below 90% - HTTPConnection in-use vs max_connections (system.metrics)
- Connections in use trending within band - no creeping leak toward max_connections
- Uptime stable - no unexplained restart resetting system.events counters mid-window
Query performance (p95 / slow queries)
- Query latency p95 below 200ms and p99 below 500ms from system.query_log percentile aggregation
- Slow-query rate below 5% - query_log entries with query_duration_ms > 1000 as share of total
- Failed queries in 24h below 100 - query_log type=‘ExceptionWhileProcessing’
- Top-10 slowest query patterns (normalized_query_hash) identified for an optimisation playbook
Replication & lag
- Replication lag (absolute_delay) below 10s on every entry in system.replicas
- Replication queue_size draining - not stuck above 100 sustained on any table
- future_parts bounded - no replica accumulating un-replicated parts
- All expected replicas active and no replica in BROKEN / session-expired state
Storage & capacity
- Database disk usage below 90% against the storage policy capacity
- Memory usage below 85% - MemoryTracking vs max_server_memory (system.metrics)
- Active parts below 1000 on every table (system.parts WHERE active) - guards Too Many Parts (code 252)
- Too Many Parts errors in 24h equal to 0 and merges keeping pace with ingest (Merge metric)
Backups & durability
- Last successful BACKUP run under 72h ago (system.backups status=‘BACKUP_CREATED’)
- No backup entry in BACKUP_FAILED state within the retention window
- ClickHouse Cloud managed snapshots present and recent for Cloud services
- Backup destination (S3 / Disk) reachable and last write verifiable
Severity thresholds
| Signal | Warn | Critical |
|---|---|---|
connection_error_rate | 1 | 5 |
query_p95_ms | 200 | 500 |
replication_lag_sec | 10 | 60 |
disk_usage_pct | 85 | 90 |
slow_query_count | 50 | 100 |
Data sources
GET https://{host}:8443/?query=SELECT%20*%20FROM%20system.build_options- Instance identity, version, build optionsGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.metrics- Live gauges - HTTPConnection, Merge, MemoryTracking, in-flight queriesGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.events- Cumulative counters - InsertedRows, cache hits/misses, error codesGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.query_log- Query latency percentiles, slow queries, exceptionsGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.parts- Active parts and partition counts per tableGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.replicas- absolute_delay, queue_size, future_parts, replica stateGET https://{host}:8443/?query=SELECT%20*%20FROM%20system.backups- Last backup status and age for durability checks