At a glance
ClickHouse Health Score is a single 0 to 100 composite that rolls up the most load-bearing operational signals on your cluster: query error rate, latency, disk and memory pressure, replication lag, the parts/merge backlog, and recent fatal errors. It answers the question an on-call DBA or a platform lead asks first: “in one number, is this cluster healthy right now?” A score of 90 plus means everything is nominal. A score below 70 fires an alert because at least one underlying signal has crossed into the danger band, and the gauge tells you the cluster needs attention before users feel it.
| What it tracks | ClickHouse Health Score for the selected period: a weighted composite of the cluster’s headline reliability and capacity signals, scored 0 (critical) to 100 (healthy). |
| Data source | Derived inside Vortex IQ from the underlying ClickHouse cards: system.query_log (error rate, latency), system.metrics (memory, connections), system.parts (parts/merge backlog), system.replicas (absolute_delay), and disk free space. No single ClickHouse table emits this number; it is computed. |
| Time window | RT/7D: a real-time gauge for “now”, with the 7-day trend line showing whether the cluster is drifting up or down. |
| Alert trigger | < 70. Any sustained score under 70 means one or more component signals are in their alert band; the composite is designed so a single severe breach (for example replication lag past threshold) can pull it under 70 on its own. |
| Roles | owner, engineering, operations |
Calculation
The score is a weighted blend of the component cards, each normalised to a 0 to 100 sub-score and then combined. The weighting prioritises signals that directly cause user-visible failure (errors, latency, capacity exhaustion) over signals that are early warnings (parts backlog, cache hit rate). A worked breakdown follows; the exact weights are tuned per profile in the Sensitivity tab, but the default shape is:| Component signal | Source card | Default weight | Healthy band |
|---|---|---|---|
| Query error rate | Query Error Rate % | 20% | < 0.5% |
| Query latency (p95/p99) | Query Latency p95 (ms) | 15% | p95 < 200ms |
| Disk usage | Database Disk Usage % | 15% | < 75% |
| Memory usage | Memory Usage % | 10% | < 70% |
| Replication lag | Replication Lag (absolute_delay) | 15% | < 2s |
| Parts / merge backlog | Active Parts (Top 10 Tables) | 10% | < 300 parts/table |
| Fatal / fatal-class errors | Failed Queries (24h) | 15% | 0 Too Many Parts / OOM |
Worked example
A SaaS analytics team runs a 3-node ClickHouse cluster behind a customer-facing reporting product. On 14 Apr 26 at 09:40 the morning reporting peak begins and the on-call DBA glances at the Health Score gauge, which has dropped from a steady 94 overnight to 68, tripping the< 70 alert.
The gauge alone says “something is wrong”. The component breakdown says where:
| Component | Reading | Sub-score | Contribution lost |
|---|---|---|---|
| Query error rate | 0.3% | 96 | -0.8 |
| Query latency p95 | 240ms | 72 | -4.2 |
| Disk usage | 71% | 88 | -1.8 |
| Memory usage | 69% | 82 | -1.8 |
| Replication lag | 14s | 0 | -15.0 |
| Parts backlog | 410 parts on events_local | 64 | -3.6 |
| Failed (fatal) queries | 0 | 100 | 0 |
events_local, which at 410 active parts is creeping toward the 1000-part danger line and dragging merge throughput, which in turn lifts latency.
- The composite is a triage tool, not a diagnosis. A score of 68 tells you to look; the component table tells you to look at replication and parts, not at error rate or disk. Always open the breakdown before acting.
- One severe signal can sink the whole score. Replication lag alone removed 15 points. Do not assume a 68 means “everything is slightly bad”; it usually means “one thing is badly wrong and a few things are mildly stressed”.
- The 7-day trend matters as much as the live value. A score that dips to 68 every morning at the reporting peak and recovers by 11:00 is a capacity-planning signal (the cluster is undersized for peak). A score that has been sliding from 94 to 68 over a week with no recovery is a leak or an unbounded growth problem (parts, disk) that will eventually become an outage.
Sibling cards
| Card | Why pair it with ClickHouse Health Score | What the combination tells you |
|---|---|---|
| Database Disk Usage % | The capacity component most likely to cause a hard stop. | Score down plus disk high equals the cluster is approaching a write-halt; act before it hits 100%. |
| Query Error Rate % | The reliability component users feel first. | Score down with error rate spiking equals customer-visible failures are happening now. |
| Query Latency p95 (ms) | The performance component behind “the dashboard is slow”. | Score down with p95 high but errors flat equals degraded, not broken; slow but serving. |
| Replication Lag (absolute_delay) | A single high-weight signal that can sink the score alone. | Score down with lag past threshold equals stale reads on followers; check the leader’s write load. |
| Active Parts (Top 10 Tables) | The early-warning component before a Too Many Parts halt. | Score sliding with parts climbing equals merge throughput is losing to ingest. |
| Memory Usage % | The capacity component behind OOM kills. | Score down with memory high foreshadows MEMORY_LIMIT_EXCEEDED (24h). |
| Failed Queries (24h) | The fatal-error component with the heaviest single-event impact. | Score down with failures rising equals real, logged query failures, not just slowness. |
| Instance Uptime | Context: was there a restart that explains a transient dip? | A score that recovered right after a restart equals a self-heal, not a fix. |
Reconciling against the source
There is no native ClickHouse “health score”. ClickHouse does not emit a single composite reliability figure, so you cannot run one command and expect to see 68. Instead, you reconcile the components and confirm that the composite is telling the truth.| What to check | Native command / table | What it confirms |
|---|---|---|
| Error rate component | SELECT countIf(type='ExceptionWhileProcessing') / count() FROM system.query_log WHERE event_time > now() - INTERVAL 5 MINUTE | Matches the Query Error Rate % sub-score input. |
| Latency component | SELECT quantile(0.95)(query_duration_ms) FROM system.query_log WHERE event_time > now() - INTERVAL 5 MINUTE AND type='QueryFinish' | Matches the p95 latency input. |
| Replication component | SELECT database, table, absolute_delay FROM system.replicas WHERE absolute_delay > 0 | Confirms the lag that zeroed a sub-score. |
| Parts component | SELECT table, count() FROM system.parts WHERE active GROUP BY table ORDER BY count() DESC LIMIT 10 | Confirms the parts backlog input. |
| Capacity components | SELECT metric, value FROM system.metrics WHERE metric IN ('MemoryTracking') and SELECT free_space, total_space FROM system.disks | Confirms memory and disk inputs. |
Known limitations / FAQs
The score is 68 but every dashboard I check looks fine. Why? Open the component breakdown. The composite weights signals that are not always visible on a generic dashboard, replication lag and parts backlog in particular. A cluster can serve reads perfectly while a follower replica drifts behind or a table accumulates parts; both are real problems that will surface later. The score is doing its job: surfacing the slow-bleed signal before it becomes an outage. Can a single component drag the whole score below 70? Yes, by design. Replication lag, error rate, and disk usage each carry enough weight (15 to 20%) that a severe breach of one zeroes its sub-score and can pull the composite under 70 on its own. This is intentional: one badly broken thing is an alert-worthy state even if everything else is green. Why is there no native ClickHouse command that returns this number? ClickHouse exposes raw metrics and system tables, not opinionated composites. The Health Score is Vortex IQ’s roll-up of those raw signals into a single triage figure. To verify it, reconcile each component against itssystem.* table as shown above.
How do I change which signals count and how much?
The default weights live in the profile’s Sensitivity tab. A team that runs single-node (no replication) should drop the replication weight to zero and redistribute it; a team with cheap, auto-expanding disk may lower the disk weight. Tune the composite to your topology rather than accepting the generic default.
Does the score account for a planned restart or maintenance window?
Not automatically. A restart resets uptime and briefly perturbs latency and connection counts, which can dip the score for a few minutes. Cross-check Instance Uptime; a dip that coincides with a low uptime value is a restart artefact, not a fault. Maintenance-window suppression is configured per profile.
The score recovered on its own. Was the problem fixed?
Not necessarily. Many ClickHouse stresses are self-limiting: a heavy backfill finishes, merges catch up, replication closes the gap. The score recovers because the transient load passed, not because the underlying capacity issue was resolved. If the same dip recurs at the same time each day, treat it as a capacity-planning signal, not a one-off.
We run a single node with no replicas. Does the replication weight hurt our score?
On a single node system.replicas is empty, so the replication sub-score defaults to healthy (100) and contributes its full weight as a positive. If you prefer the weight redistributed to capacity and error signals, set the replication weight to zero in the Sensitivity tab.