Queries per Second (live), ClickHouse

Card class: Hero • Category: Executive Overview

At a glance

Queries per Second (live) is the rate at which the ClickHouse instance is accepting and executing queries, sampled in real time. For a platform team this is the single pulse that says “how busy is the database right now?” It is the denominator behind almost every other ratio on the board: error rate, slow-query rate, and latency percentiles all read differently at 50 QPS than at 5,000 QPS. A sudden QPS swing, up or down, is usually the first sign that something changed upstream: a deploy, a dashboard storm, a bot, or a stalled ingest pipeline.


What it tracks	The number of queries the server starts per second, computed live from the `Query` event delta in `system.events` divided by the sampling interval.
Data source	Queries per Second (live) for the selected period, derived from the `Query` counter in `system.events` (a monotonic event counter) sampled at short intervals and differenced to produce a rate.
Metric basis	Query starts, not query completions. A long-running query counts once at start; it does not inflate QPS while it runs. This keeps QPS a clean arrival-rate signal.
Aggregation window	Real-time gauge (`RT`). The headline shows the latest sampled rate; the sparkline shows the recent trend.
Time window	`RT` (real-time)
Alert trigger	None. QPS is a context metric, not an alarm. Read it alongside the error, latency, and saturation cards which carry their own thresholds.
What counts	All query starts the server records: SELECTs, INSERTs, DDL, and system queries, across native, HTTP, and wire-protocol interfaces.
What does NOT count	Queries rejected before execution (for example, refused at the connection layer) and purely internal background operations (merges, mutations) that do not register as a `Query` event.
Roles	owner, engineering, operations

Calculation

ClickHouse exposes a monotonic Query counter in system.events that increments every time a query starts. QPS is the delta of that counter over the sampling interval:

-- Two samples a few seconds apart, differenced into a rate.
-- Conceptually:
--   qps = (Query_now - Query_prev) / (t_now - t_prev)
SELECT value AS query_count_now
FROM system.events
WHERE event = 'Query';

The engine reads the Query event at each sample, subtracts the previous sample, and divides by the elapsed seconds to produce the live rate. Because system.events is a server-lifetime cumulative counter, a single reading is meaningless on its own; the rate only emerges from differencing two readings. On a multi-node cluster the card sums per-node rates to give cluster-wide QPS. See the At a glance summary for what the metric tracks and the worked example below for a typical reading.

Worked example

A platform team runs ClickHouse behind a real-time product-analytics dashboard plus an event-ingest pipeline. Normal weekday QPS sits around 800. Snapshot sequence taken on 14 Apr 26 across the morning:

Time (BST)	QPS (live)	What was happening
08:30	780	Baseline, overnight batch finished
09:00	1,240	Analysts arrive, dashboards refresh
09:05	4,900	Sudden spike
09:20	1,180	Back to morning normal

The 09:05 spike to 4,900 QPS is six times baseline. Three readings the team should take from this card:

QPS alone never tells you if a spike is good or bad. Six times baseline could be a genuine traffic surge (great), a runaway dashboard with auto-refresh set to 1 second (wasteful), or a bot hammering an unauthenticated endpoint (a problem). To classify it, pair this card with ClickHouse QPS Spike vs Ecom Order Rate. If orders spiked too, it is real demand; if orders are flat, it is a dashboard storm or a bot.
QPS reframes every ratio on the board. At 09:05 the Query Error Rate % card showed 0.8%. At baseline 800 QPS that is roughly 6 failed queries per second; at the spike’s 4,900 QPS the same 0.8% is roughly 39 failed queries per second. The percentage looked stable but the absolute failure volume jumped 6x. Always read error and slow-query percentages against the QPS denominator.
A QPS collapse is as informative as a spike. If QPS suddenly drops toward zero while the application is plainly still serving users, the database is likely refusing connections (check Connection Pool Saturation %) or the ingest pipeline has stalled (check Inserts per Second (live)). A flat-line at zero during business hours is an outage signal, not a quiet period.

Sizing the 09:05 spike:
  - Baseline QPS:        800
  - Spike QPS:           4,900  (6.1x baseline)
  - Error rate held at:  0.8%
  - Failed queries/sec:  baseline ~6  ->  spike ~39
  - p95 latency moved:   42ms -> 118ms (queuing under load)
  - Verdict: classify against order rate before scaling.

The correct response to a QPS spike is to classify before reacting. If ClickHouse QPS Spike vs Ecom Order Rate shows orders rising in step, scale capacity. If orders are flat, find the noisy client in system.processes or system.query_log (group by initial_user or http_user_agent) and rate-limit it rather than scaling the cluster to serve waste.

Sibling cards platform teams should reference together

Card	Why pair it with Queries per Second	What the combination tells you
Query Error Rate %	QPS is the denominator; error rate is the ratio.	A stable error percentage at rising QPS still means more absolute failures per second.
Query Latency p95 (ms)	Latency typically climbs as QPS approaches capacity.	Rising p95 with rising QPS equals queuing; rising p95 with flat QPS equals a slow query, not load.
Connection Pool Saturation %	A QPS collapse often means refused connections.	QPS dropping while saturation is at 100% confirms the pool is the bottleneck.
Inserts per Second (live)	The write-side companion to read QPS.	QPS steady but inserts at zero means the ingest pipeline stalled while reads carry on.
Slow-Query Rate %	High QPS plus high slow-query rate compounds load.	Tells you whether the spike is cheap point queries or expensive scans.
ClickHouse Health Score	The executive composite that contextualises QPS.	Confirms whether a busy instance is also a healthy one.
ClickHouse QPS Spike vs Ecom Order Rate	Classifies a spike as real demand or noise.	QPS up with orders up equals real; QPS up with orders flat equals dashboard storm or bot.

Reconciling against the source

Where to look in ClickHouse’s own tooling:

system.events for the cumulative Query counter: SELECT value FROM system.events WHERE event = 'Query'. Take two readings a few seconds apart and divide the difference by the elapsed time to reproduce the live rate. system.metrics for the instantaneous Query gauge (queries currently running), which is a different thing: running queries, not arrival rate. system.query_log for a historical, exact count: SELECT count() FROM system.query_log WHERE type = 'QueryStart' AND event_time >= now() - 60 gives queries started in the last minute, divide by 60 for QPS. ClickHouse Cloud console (managed service): the Metrics tab plots query rate per service over time.

Why our number may legitimately differ from a direct query:

Reason	Direction	Why
Sampling vs exact log	Variable	The live card differences `system.events` over a short interval; `system.query_log` gives an exact retrospective count. Short-interval sampling can read slightly above or below the per-minute average during bursts.
Per-node vs cluster	Our number higher	The card sums per-node QPS for cluster-wide rate; a single-node query reflects one node only.
Counter reset on restart	One-off	`system.events` resets when the server restarts; a sample spanning a restart is discarded by the engine but a manual differenced reading would show a negative or nonsensical value.
Query type inclusion	Variable	The `Query` event counts all query types (SELECT, INSERT, DDL, system). A manual count filtered to SELECTs only will read lower.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
ClickHouse QPS Spike vs Ecom Order Rate	QPS should rise and fall roughly in step with storefront order/click rate.	QPS up with orders flat equals a non-shopper-driven spike (dashboard storm, bot, retry loop).
Storefront traffic cards	Genuine demand moves both.	A divergence is the signal that the QPS change is internal, not customer-driven.

Known limitations / FAQs

Why is there no alert threshold on QPS? QPS has no inherently good or bad value. 5,000 QPS is healthy for one cluster and a crisis for another. Alerting lives on the consequence metrics (error rate, latency, saturation), which carry their own thresholds. QPS is the context you read those alarms against, not an alarm itself. What is the difference between the Query event and the Query metric? The Query event in system.events is a cumulative counter of queries started since server boot; differencing it gives arrival rate (this card). The Query metric in system.metrics is an instantaneous gauge of queries running right now. High concurrency (gauge) and high arrival rate (this card) are related but distinct: a few long queries can make the gauge high while QPS stays low. My manual count from system.query_log does not match the card. Why? Two reasons. First, the card samples system.events over a short live interval while the log gives an exact retrospective count, so they differ during bursts. Second, system.query_log only records queries if logging is enabled and not sampled down; if log_queries is off or log_queries_probability is below 1, the log undercounts. The live card reads the event counter directly and is unaffected by log sampling. Does QPS include INSERT queries? Yes. The Query event counts every query start regardless of type. If you want read-only QPS, filter system.query_log by query_kind = 'Select'. For most capacity decisions the combined figure is what matters, because INSERTs compete for the same threads and connections as reads. QPS dropped to near zero during business hours. Is the card broken? Almost certainly not, and that drop is a serious signal. The usual causes are: the connection pool is full and refusing new connections (check Connection Pool Saturation %), the server is overloaded and queries are queuing rather than starting, or an upstream component stopped sending traffic. A genuine zero during business hours is an outage, not a quiet period. How does QPS behave on a multi-node cluster? The card sums per-node arrival rates into one cluster-wide QPS. If load is unbalanced (one node taking most queries), the cluster total can look healthy while one node is saturated. For per-node detail, query system.events on each node directly or use the Cloud console’s per-node view. Does a server restart affect the reading? The Query event counter resets to zero on restart. The engine detects the reset (a sample lower than the previous) and discards that interval rather than reporting a negative rate, so the live card stays clean across restarts. A manual differenced reading spanning a restart would show a misleading negative value.

Tracked live in Vortex IQ Nerve Centre

Queries per Second (live) is one of hundreds of KPI pulses Vortex IQ tracks across ClickHouse and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards platform teams should reference together

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre