Too Many Parts Error or Parts Count >1000 on Any Table, ClickHouse

Card class: Hero • Category: Nerve Centre

At a glance

The single most ClickHouse-distinctive alert in the set. It fires on either of two conditions: a TOO_MANY_PARTS error (code 252) has been thrown in the last 24 hours, or any MergeTree table is sitting above 1,000 active parts right now. Both point at the same failure mode: inserts have outrun background merges, the parts have piled up, and the parts cap is being hit. When that cap is reached, ClickHouse stops accepting inserts on that table entirely, so ingest halts. There is no equivalent in a traditional row store; it is unique to ClickHouse’s part-and-merge storage model, and it is the one failure a ClickHouse DBA must never miss.


Data source	Two signals OR’d together: (1) `system.query_log` / `system.errors` for code 252 (`TOO_MANY_PARTS`) raised in the last 24h, and (2) `system.parts WHERE active` grouped by table for any table above 1,000 active parts.
What it tracks	The combined “ingest is at risk OR already broken” state. The leading indicator (part count climbing) and the lagging indicator (the error actually firing) share one alert so you cannot miss either.
Why it matters	Code 252 is not a soft failure. The hard cap `parts_to_throw_insert` rejects the insert outright, so the affected table stops ingesting. In an analytics pipeline that means the data stops flowing while the storefront keeps generating events: a silent, growing data gap.
Time window	`RT` (real-time; part counts are live state, the error condition looks back 24h).
Alert trigger	`too_many_parts_24h > 0 OR active_parts > 1000`. Either an error in the last day or a table over the part threshold fires the card.
Roles	dba, platform, sre

Calculation

The detector evaluates two conditions and fires if either is true. Condition 1, the error counter (lagging):

SELECT count() AS too_many_parts_24h
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
  AND exception_code = 252            -- TOO_MANY_PARTS
  AND event_time >= now() - INTERVAL 24 HOUR

Condition 2, the part-count gauge (leading):

SELECT table, count() AS active_parts
FROM system.parts
WHERE active
GROUP BY table
HAVING active_parts > 1000
ORDER BY active_parts DESC

The alert is too_many_parts_24h > 0 OR max(active_parts) > 1000. The OR is deliberate and important. Part count rising past 1,000 is the early warning: the backlog is building but inserts are still being accepted. Code 252 in the log is the confirmation: an insert has actually been rejected and ingest on that table has stalled. Wiring both into one alert means a DBA is paged when the danger first appears (part count) and is paged again if it actually breaks (error code), with no gap in coverage. The relevant caps live in system.merge_tree_settings: parts_to_delay_insert (soft throttle, default 150 per partition), parts_to_throw_insert (hard cap, default 300 per partition), and max_parts_in_total (table-wide ceiling, default 100,000).

Worked example

A platform team ingests storefront clickstream and order events into ClickHouse. An ingest config change shipped on the evening of 17 Apr 26 quietly reduced the batch size. Snapshot of the alert state on 18 Apr 26 at 02:15 BST.

Condition	Value	Status
`too_many_parts_24h` (code 252 count)	37	Breached
`events.clickstream_raw` active parts	1,420	Breached
`events.order_events` active parts	880	Approaching
`parts_to_throw_insert` (per partition)	300	Config

The Nerve Centre headline reads 37 TOO_MANY_PARTS errors in 24h, clickstream_raw at 1,420 parts, outlined red. The DBA reads the state:

Ingest on clickstream_raw is already broken. 37 code-252 errors means 37 insert batches were rejected outright in the last day. Every rejected batch is storefront event data that did not land. The data gap is growing in real time.
The cause is the batch-size change. The errors began climbing the same evening as the ingest config ship. Smaller batches mean more parts per unit of data, the part count crossed the per-partition parts_to_throw_insert cap, and inserts started bouncing.
order_events is next. At 880 parts and climbing, it is on the same trajectory. If the root cause is not fixed, it breaches within the hour and a second table stops ingesting.

Immediate response:
  1. Confirm the gap: how much data is missing?
       SELECT max(event_time) FROM events.clickstream_raw   -- last row that landed
     Compare against storefront's last-known event time.
  2. Stop the bleeding: revert the batch-size change so inserts batch large again.
  3. Drain the backlog: raise background_pool_size to merge faster, OR
       OPTIMIZE TABLE events.clickstream_raw PARTITION '2026-04-18' FINAL
       (heavy; target the hot partition only, off-peak).
  4. Backfill: replay the rejected batches from the ingest buffer / source once
     part count drops back below the throw threshold.

The recovery order matters: stop creating new parts first (revert the batch change), then drain the existing backlog (merges or a targeted OPTIMIZE), then backfill the lost data. Running OPTIMIZE FINAL while the over-inserting job is still hammering the table just fights the inflow and wastes I/O. Three takeaways:

Code 252 means data loss, not just an error. Unlike a slow query, a rejected insert drops data on the floor unless your pipeline buffers and retries. Treat every nonzero too_many_parts_24h as a potential data-gap incident and check what is missing.
The part-count condition is your early warning; use it. If you act while only the >1000 part condition is breached and the error count is still zero, you prevent the outage entirely. Once code 252 appears you are in recovery, not prevention.
The fix is almost always insert shape. Smaller batches, higher insert frequency, or a too-granular partition key. Batch larger, use async_insert, or coarsen the partition key. Tuning merges only buys time.

Sibling cards

Card	Why pair it with this alert	What the combination tells you
Active Parts (Top 10 Tables)	The per-table part-count detail behind condition 2.	Tells you exactly which table is over threshold and by how much.
Too Many Parts Errors (24h)	The raw code-252 counter behind condition 1.	The absolute count of rejected inserts: the size of the data gap.
Merges In Progress	The merge throughput draining the backlog.	Low merges plus high parts equals the scheduler is starved; raise `background_pool_size`.
Inserts per Second (live)	The inflow creating parts.	High inserts/sec at small batch size is the classic cause.
Partition Count (Top 10 Tables)	The partition-granularity view.	Many partitions equals a too-granular partition key spreading parts thin.
Query Error Rate Spike (>1% in 5m)	The broader error feed where code 252 also surfaces.	An error spike dominated by code 252 confirms ingest, not queries, is the fault.
ClickHouse Health Score	The composite that weights ingest health heavily.	Any active code 252 pulls the composite sharply down.

Reconciling against the source

Where to look in ClickHouse’s own tooling:

Check the error count and the live part count directly in clickhouse-client:
-- condition 1: errors in the last 24h
SELECT count() FROM system.query_log
WHERE exception_code = 252 AND event_time >= now() - INTERVAL 24 HOUR;

-- condition 2: tables over the part threshold
SELECT table, count() FROM system.parts WHERE active
GROUP BY table HAVING count() > 1000 ORDER BY count() DESC;
Read the live caps with SELECT name, value FROM system.merge_tree_settings WHERE name LIKE 'parts_to%'. Watch active merges draining the backlog in system.merges. On ClickHouse Cloud, all of these system tables are available in the SQL console; the managed monitoring view also flags part-count and ingest-error conditions.

Why our number may legitimately differ from a manual check:

Reason	Direction	Why
24h window edge	Either	The error count uses a rolling 24h window; a manual query a few minutes apart includes or drops events near the boundary.
active filter	Manual higher if omitted	A part-count query without `WHERE active` counts inactive merge leftovers and overstates the backlog. The card always filters to active parts.
Replica scope	Card may be higher	On a cluster the card aggregates the configured node; a single-replica query understates total parts. Use `clusterAllReplicas`.
log_queries sampling	Card lower	If `log_queries_probability < 1`, code-252 events are sampled out of `system.query_log` and the error count undercounts. Set it to 1.

Cross-connector reconciliation:

Card	Expected relationship	What causes divergence
ClickHouse Event Ingest vs Ecom Orders	When this alert fires, event ingest stalls while storefront orders keep flowing, opening a visible divergence.	The two lines splitting apart is the storefront-visible symptom of code 252: orders happening, events not landing.

Known limitations / FAQs

What actually happens when code 252 fires? Is data lost? ClickHouse rejects the insert outright with the TOO_MANY_PARTS exception. The insert does not complete and the rows are not stored. Whether that becomes permanent data loss depends on your pipeline: if the ingest layer buffers and retries (a Kafka consumer, a queue with at-least-once delivery), the batch is replayed once the table can accept inserts again and nothing is lost. If the producer fires and forgets, those rows are gone. Treat every code 252 as a potential data-gap incident and verify against the source. Why are there two conditions in one alert? Because one is prevention and the other is recovery. The part count crossing 1,000 is the leading indicator: the backlog is building but inserts still work, so you have time to fix it before any data is lost. Code 252 appearing is the lagging indicator: an insert has already been rejected. OR-ing them means you are paged early enough to prevent the outage and, if you miss that window, paged again when it actually breaks. Can a table hold more than 1,000 active parts without ever throwing code 252? Yes, because the hard cap parts_to_throw_insert is per partition, not per table. A table partitioned by day across many days can hold well over 1,000 active parts in total while no single partition exceeds 300. That is exactly why this alert OR’s the table-level part count with the error code: the part count is an aggregate early warning, and the error code is the precise per-partition failure. Read both together. The error count is zero but a table shows 1,400 parts. Why am I being paged? Because you are about to break. Zero errors plus a table over 1,000 parts means the backlog has built but the per-partition cap has not yet been hit on any single partition. This is the best possible time to be paged: act now (batch larger, raise merge concurrency) and you avoid code 252 entirely. If you wait, the next busy partition crosses the cap and ingest stops. How do I clear it once it has fired? In order: (1) stop creating new parts, revert whatever increased insert frequency or shrank batch size; (2) drain the backlog, raise background_pool_size so more merges run, or run OPTIMIZE TABLE ... PARTITION ... FINAL on the hot partition off-peak; (3) once part count drops below the throw threshold, replay any rejected batches to backfill the gap. Do not run OPTIMIZE while the over-inserting job is still active; fix the inflow first. Is this really unique to ClickHouse? Effectively yes. The part-and-merge storage model, where every insert writes a new immutable part and a background process merges them, is specific to the MergeTree family. Traditional row stores (Postgres, MySQL) and many other column stores do not have a per-table parts cap that rejects inserts, so they have no direct equivalent. This is why batching inserts is a first-class concern on ClickHouse in a way it is not elsewhere, and why this alert has no sibling on other connectors.

Tracked live in Vortex IQ Nerve Centre

Too Many Parts Error or Parts Count >1000 on Any Table is one of hundreds of KPI pulses Vortex IQ tracks across ClickHouse and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre