At a glance
How unevenly data is spread across the shards of a sharded cluster, expressed as a percentage. In a healthy cluster the balancer keeps roughly the same number of chunks on every shard, so any one shard carries a fair share of the read and write load. This gauge measures the gap between the busiest shard and the quietest one: it takes the shard with the most chunks, subtracts the shard with the fewest, and divides by the total chunk count. A low value means data is evenly distributed; a high value means one shard is carrying far more than its share and has become a hot shard. The card turns red at>20%: at that point a single shard is doing disproportionate work, which shows up as uneven latency, one over-loadedmongod, and wasted capacity on the idle shards.
| What it tracks | The distribution imbalance of chunks across all shards in a sharded cluster, as a live gauge from 0% (perfectly even) upward. Only relevant to sharded clusters; a single replica set has no shards to compare. |
| Data source | (max chunks per shard - min chunks per shard) / total chunks. Chunk counts are read from the config.chunks collection on the config servers, grouped by shard. max is the chunk count on the busiest shard, min is the count on the quietest, and total is the sum across all shards. |
| Time window | RT (real-time). The gauge reflects the current chunk distribution as last reported by the config servers; chunk metadata changes only when the balancer moves a chunk or a split occurs, so the value is stable between balancer rounds. |
| Alert trigger | >20%. A skew above 20% raises a sensitivity alert: one shard holds enough of an excess that it has effectively become a hot shard and the cluster is no longer balanced. |
| What counts | All shards listed in config.shards, including drained or draining shards until they leave the cluster. The figure is computed over every sharded collection’s chunks combined, so a per-shard total reflects all the data that shard owns. |
| What does NOT count | Unsharded collections (these live entirely on the primary shard and are not part of chunk balancing), jumbo chunks that cannot be split or moved (they inflate one shard but are flagged separately), and zones / tag ranges that intentionally pin data to specific shards. |
| Roles | owner, platform, sre, dba |
Calculation
The gauge is a single ratio computed from chunk counts held on the config servers:config.chunks, the authoritative metadata that the config servers maintain for every chunk in the cluster. Vortex IQ groups that collection by shard, counts the chunks on each, then takes the highest count (max_chunks_per_shard), the lowest count (min_chunks_per_shard), and the sum (total_chunks).
A worked sense of the scale:
- Three shards holding 100 / 100 / 100 chunks: skew is
(100 - 100) / 300 = 0%. Perfectly balanced. - Three shards holding 140 / 90 / 70 chunks: skew is
(140 - 70) / 300 = 23.3%. Above the 20% threshold; shard A is hot. - Two shards holding 250 / 50 chunks: skew is
(250 - 50) / 300 = 66.7%. Severely skewed; the cluster is barely sharding at all.
Worked example
A platform team runs a four-shard cluster behind a high-traffic catalogue and session service. Snapshot taken on 14 Apr 26 at 09:20 BST during the morning ramp.| Shard | Chunks owned | Share of total | Notes |
|---|---|---|---|
| shard-rs-A | 612 | 41% | Primary shard, also holds the unsharded collections |
| shard-rs-B | 318 | 21% | Healthy |
| shard-rs-C | 296 | 20% | Healthy |
| shard-rs-D | 264 | 18% | Newest shard, still filling |
| Total | 1,490 | 100% |
(612 - 264) / 1,490 = 23.4%. The gauge reads 23.4% and turns red, just over the 20% threshold.
The team reads three things from this:
-
Shard A is hot. At 612 chunks it carries more than twice what shard D carries. Because shard A is also the primary shard (where unsharded collections live), it is absorbing both its fair share of sharded data and all the unsharded traffic. Its
mongodCPU and WiredTiger cache pressure will be the highest of the four, and any query that fans out across shards will be gated by shard A’s response time. - Shard D was added recently and is still filling. The balancer moves chunks gradually (one at a time per shard pair, by default), so a freshly added shard takes hours to days to reach parity depending on chunk size and the balancing window. Some of this skew is therefore expected and self-correcting: the question is whether it is trending down.
- The action depends on the trend, not the snapshot. A single 23% reading the day after adding a shard is normal. The same 23% reading that has not moved in a week means the balancer is stuck (see Chunks Pending Migration) or the balancing window is too narrow to catch up. The team checks whether the balancer is enabled and whether the migration queue is draining.
Sibling cards
| Card | Why pair it with Shard Balance Skew | What the combination tells you |
|---|---|---|
| Chunks Pending Migration | The balancer’s work queue. | High skew with a large pending queue equals “balancer is overloaded and catching up”. High skew with an empty queue equals “balancer thinks it is done; the skew is structural (shard key or jumbo chunks)”. |
| Operations per Second (live) | The actual load on the cluster. | Confirms whether the chunk imbalance is causing a real load imbalance or is just a cosmetic count gap on a lightly used cluster. |
| Replica Set Members (state) | Per-shard health, since each shard is itself a replica set. | A hot shard whose primary is also under election pressure is a compounding risk; check both. |
| Replica Lag (seconds) | Replication health on the busiest shard. | A hot shard’s secondaries are the first to fall behind under write pressure. |
| WiredTiger Cache Hit Rate % | Cache pressure on the over-loaded shard. | The hot shard’s cache hit rate drops first because it is serving more working set than its peers. |
| Query Latency p95 (ms) | Tail latency, which a hot shard inflates. | Scatter-gather queries are gated by the slowest shard, so skew shows up as p95 latency before it shows up in the average. |
| MongoDB Health Score | The composite that takes sharding balance as an input. | A sustained skew above 20% drags the composite health score down. |
Reconciling against the source
Where to look in MongoDB’s own tooling:On MongoDB Atlas, the cluster’s Metrics tab does not surface chunk skew as a single number, but the per-shard breakdown of operations, connections, and disk usage will show the same imbalance from the load side. Atlas also exposes the balancer state in the cluster configuration. Why our number may legitimately differ fromsh.status()inmongoshagainst amongosrouter prints the full sharding summary, including the per-shard chunk distribution for every sharded collection. This is the canonical view of what the gauge is computing.db.getSiblingDB("config").chunks.aggregate([{$group:{_id:"$shard",n:{$sum:1}}}])against the config database gives you the exact per-shard chunk counts Vortex IQ uses, so you can reproduce the max, min, and total by hand.sh.getBalancerState()andsh.isBalancerRunning()confirm whether the balancer is enabled and currently moving chunks, which explains whether a high skew is being actively corrected.
sh.status():
| Reason | Direction | Why |
|---|---|---|
| Balancer round in progress | Brief discrepancy | If a chunk migration commits between our read of config.chunks and your sh.status(), the counts shift by one chunk per migration. |
| Per-collection vs cluster-wide | Variable | sh.status() prints skew per collection; our gauge aggregates all sharded collections into one cluster-wide figure. A single noisy collection can dominate. |
| Jumbo chunks | Our value may read higher | Jumbo chunks cannot be moved, so they pin count to one shard. They count toward the skew but the balancer will never resolve them; sh.status() flags them with a jumbo marker. |
| Draining shards | Transient spike | A shard being removed shows a falling count as its chunks evacuate; the skew temporarily widens then collapses to the remaining shards. |
Known limitations / FAQs
My cluster is a single replica set, not sharded. What does this card show? Nothing meaningful. Shard balance skew only applies to sharded clusters with two or more shards. On a single replica set there is one data-bearing topology and no chunks to compare, so the card reports zero or is hidden. If you expect this card to populate, confirm the connector is pointed at amongos router and that the cluster actually has more than one shard via sh.status().
The skew jumped to 25% right after I added a new shard. Is something wrong?
No, that is expected and self-correcting. A newly added shard starts with zero chunks, so the gap between the fullest and emptiest shard is at its widest the moment the shard joins. The balancer then migrates chunks to it gradually, and the skew falls over the following hours. Watch the trend: if it is dropping, leave it alone. Only worry if it has not moved for a day or more.
Skew is high but Chunks Pending Migration is zero. Why is the balancer not fixing it?
This is the signature of a structural problem rather than a balancer lag. The balancer balances chunk count, and if the chunk counts are already even but one shard’s chunks are far larger or far busier, the balancer considers its job done. The usual causes are jumbo chunks (chunks too large to split or move), a poor shard key that concentrates data into a few ranges, or zones / tag ranges that intentionally pin data. Check for jumbo chunks in sh.status() and review the shard key cardinality.
Does the skew account for how busy each shard is, or only how much data it holds?
Only how much data, measured in chunk count. A shard can be perfectly balanced by chunk count yet hot by traffic if a small number of chunks receive most of the operations (a hot range). That is why this card is paired with Operations per Second (live) and the latency cards: chunk skew is the structural signal, operation rate is the load signal, and you need both to diagnose a hot shard fully.
Can I change the 20% alert threshold?
Yes. The 20% trigger is the generic default for a hot-shard warning. Sensitivity thresholds are configurable per profile in the Sensitivity tab. Clusters that run intentionally lopsided (for example zone-pinned data for regional compliance) may want a higher threshold so the alert does not fire on a deliberate imbalance.
Is the balancing window affecting my skew?
It can. If you have configured a balancing window (a time range during which the balancer is allowed to run), the cluster only rebalances inside that window. A high skew during business hours that only corrects overnight is the expected behaviour of a narrow window. If skew routinely breaches 20% during the day and only resolves at night, either widen the window or accept the daytime imbalance as a known trade-off.