At a glance
Queries per Second (live) is the rate at which the ClickHouse instance is accepting and executing queries, sampled in real time. For a platform team this is the single pulse that says “how busy is the database right now?” It is the denominator behind almost every other ratio on the board: error rate, slow-query rate, and latency percentiles all read differently at 50 QPS than at 5,000 QPS. A sudden QPS swing, up or down, is usually the first sign that something changed upstream: a deploy, a dashboard storm, a bot, or a stalled ingest pipeline.
| What it tracks | The number of queries the server starts per second, computed live from the Query event delta in system.events divided by the sampling interval. |
| Data source | Queries per Second (live) for the selected period, derived from the Query counter in system.events (a monotonic event counter) sampled at short intervals and differenced to produce a rate. |
| Metric basis | Query starts, not query completions. A long-running query counts once at start; it does not inflate QPS while it runs. This keeps QPS a clean arrival-rate signal. |
| Aggregation window | Real-time gauge (RT). The headline shows the latest sampled rate; the sparkline shows the recent trend. |
| Time window | RT (real-time) |
| Alert trigger | None. QPS is a context metric, not an alarm. Read it alongside the error, latency, and saturation cards which carry their own thresholds. |
| What counts | All query starts the server records: SELECTs, INSERTs, DDL, and system queries, across native, HTTP, and wire-protocol interfaces. |
| What does NOT count | Queries rejected before execution (for example, refused at the connection layer) and purely internal background operations (merges, mutations) that do not register as a Query event. |
| Roles | owner, engineering, operations |
Calculation
ClickHouse exposes a monotonicQuery counter in system.events that increments every time a query starts. QPS is the delta of that counter over the sampling interval:
Query event at each sample, subtracts the previous sample, and divides by the elapsed seconds to produce the live rate. Because system.events is a server-lifetime cumulative counter, a single reading is meaningless on its own; the rate only emerges from differencing two readings. On a multi-node cluster the card sums per-node rates to give cluster-wide QPS. See the At a glance summary for what the metric tracks and the worked example below for a typical reading.
Worked example
A platform team runs ClickHouse behind a real-time product-analytics dashboard plus an event-ingest pipeline. Normal weekday QPS sits around 800. Snapshot sequence taken on 14 Apr 26 across the morning:| Time (BST) | QPS (live) | What was happening |
|---|---|---|
| 08:30 | 780 | Baseline, overnight batch finished |
| 09:00 | 1,240 | Analysts arrive, dashboards refresh |
| 09:05 | 4,900 | Sudden spike |
| 09:20 | 1,180 | Back to morning normal |
- QPS alone never tells you if a spike is good or bad. Six times baseline could be a genuine traffic surge (great), a runaway dashboard with auto-refresh set to 1 second (wasteful), or a bot hammering an unauthenticated endpoint (a problem). To classify it, pair this card with ClickHouse QPS Spike vs Ecom Order Rate. If orders spiked too, it is real demand; if orders are flat, it is a dashboard storm or a bot.
- QPS reframes every ratio on the board. At 09:05 the Query Error Rate % card showed 0.8%. At baseline 800 QPS that is roughly 6 failed queries per second; at the spike’s 4,900 QPS the same 0.8% is roughly 39 failed queries per second. The percentage looked stable but the absolute failure volume jumped 6x. Always read error and slow-query percentages against the QPS denominator.
- A QPS collapse is as informative as a spike. If QPS suddenly drops toward zero while the application is plainly still serving users, the database is likely refusing connections (check Connection Pool Saturation %) or the ingest pipeline has stalled (check Inserts per Second (live)). A flat-line at zero during business hours is an outage signal, not a quiet period.
system.processes or system.query_log (group by initial_user or http_user_agent) and rate-limit it rather than scaling the cluster to serve waste.
Sibling cards platform teams should reference together
| Card | Why pair it with Queries per Second | What the combination tells you |
|---|---|---|
| Query Error Rate % | QPS is the denominator; error rate is the ratio. | A stable error percentage at rising QPS still means more absolute failures per second. |
| Query Latency p95 (ms) | Latency typically climbs as QPS approaches capacity. | Rising p95 with rising QPS equals queuing; rising p95 with flat QPS equals a slow query, not load. |
| Connection Pool Saturation % | A QPS collapse often means refused connections. | QPS dropping while saturation is at 100% confirms the pool is the bottleneck. |
| Inserts per Second (live) | The write-side companion to read QPS. | QPS steady but inserts at zero means the ingest pipeline stalled while reads carry on. |
| Slow-Query Rate % | High QPS plus high slow-query rate compounds load. | Tells you whether the spike is cheap point queries or expensive scans. |
| ClickHouse Health Score | The executive composite that contextualises QPS. | Confirms whether a busy instance is also a healthy one. |
| ClickHouse QPS Spike vs Ecom Order Rate | Classifies a spike as real demand or noise. | QPS up with orders up equals real; QPS up with orders flat equals dashboard storm or bot. |
Reconciling against the source
Where to look in ClickHouse’s own tooling:Why our number may legitimately differ from a direct query:system.eventsfor the cumulativeQuerycounter:SELECT value FROM system.events WHERE event = 'Query'. Take two readings a few seconds apart and divide the difference by the elapsed time to reproduce the live rate.system.metricsfor the instantaneousQuerygauge (queries currently running), which is a different thing: running queries, not arrival rate.system.query_logfor a historical, exact count:SELECT count() FROM system.query_log WHERE type = 'QueryStart' AND event_time >= now() - 60gives queries started in the last minute, divide by 60 for QPS. ClickHouse Cloud console (managed service): the Metrics tab plots query rate per service over time.
| Reason | Direction | Why |
|---|---|---|
| Sampling vs exact log | Variable | The live card differences system.events over a short interval; system.query_log gives an exact retrospective count. Short-interval sampling can read slightly above or below the per-minute average during bursts. |
| Per-node vs cluster | Our number higher | The card sums per-node QPS for cluster-wide rate; a single-node query reflects one node only. |
| Counter reset on restart | One-off | system.events resets when the server restarts; a sample spanning a restart is discarded by the engine but a manual differenced reading would show a negative or nonsensical value. |
| Query type inclusion | Variable | The Query event counts all query types (SELECT, INSERT, DDL, system). A manual count filtered to SELECTs only will read lower. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
| ClickHouse QPS Spike vs Ecom Order Rate | QPS should rise and fall roughly in step with storefront order/click rate. | QPS up with orders flat equals a non-shopper-driven spike (dashboard storm, bot, retry loop). |
| Storefront traffic cards | Genuine demand moves both. | A divergence is the signal that the QPS change is internal, not customer-driven. |
Known limitations / FAQs
Why is there no alert threshold on QPS? QPS has no inherently good or bad value. 5,000 QPS is healthy for one cluster and a crisis for another. Alerting lives on the consequence metrics (error rate, latency, saturation), which carry their own thresholds. QPS is the context you read those alarms against, not an alarm itself. What is the difference between theQuery event and the Query metric?
The Query event in system.events is a cumulative counter of queries started since server boot; differencing it gives arrival rate (this card). The Query metric in system.metrics is an instantaneous gauge of queries running right now. High concurrency (gauge) and high arrival rate (this card) are related but distinct: a few long queries can make the gauge high while QPS stays low.
My manual count from system.query_log does not match the card. Why?
Two reasons. First, the card samples system.events over a short live interval while the log gives an exact retrospective count, so they differ during bursts. Second, system.query_log only records queries if logging is enabled and not sampled down; if log_queries is off or log_queries_probability is below 1, the log undercounts. The live card reads the event counter directly and is unaffected by log sampling.
Does QPS include INSERT queries?
Yes. The Query event counts every query start regardless of type. If you want read-only QPS, filter system.query_log by query_kind = 'Select'. For most capacity decisions the combined figure is what matters, because INSERTs compete for the same threads and connections as reads.
QPS dropped to near zero during business hours. Is the card broken?
Almost certainly not, and that drop is a serious signal. The usual causes are: the connection pool is full and refusing new connections (check Connection Pool Saturation %), the server is overloaded and queries are queuing rather than starting, or an upstream component stopped sending traffic. A genuine zero during business hours is an outage, not a quiet period.
How does QPS behave on a multi-node cluster?
The card sums per-node arrival rates into one cluster-wide QPS. If load is unbalanced (one node taking most queries), the cluster total can look healthy while one node is saturated. For per-node detail, query system.events on each node directly or use the Cloud console’s per-node view.
Does a server restart affect the reading?
The Query event counter resets to zero on restart. The engine detects the reset (a sample lower than the previous) and discards that interval rather than reporting a negative rate, so the live card stays clean across restarts. A manual differenced reading spanning a restart would show a misleading negative value.