Database Disk Usage %, gauge - Vortex IQ Help Centre

Card class: Hero • Category: Executive Overview

At a glance

The percentage of provisioned data volume currently consumed by the PostgreSQL instance: table data, indexes, WAL, temporary files, and the catalogue. For a platform team this is the single most unforgiving capacity number on the board. PostgreSQL does not gracefully degrade when the data disk fills: once the volume hits 100%, the database refuses new writes, autovacuum cannot reclaim space, and on many managed services the instance is forced into a read-only or recovery state. This card is the early-warning gauge that keeps you ahead of that wall.


What it tracks	Used bytes divided by total provisioned bytes on the volume that holds the PostgreSQL data directory (`PGDATA`), expressed as a percentage. Includes heap, indexes, the WAL directory (`pg_wal`), temporary files, and catalogue bloat.
Data source	”Database Disk Usage % for the selected period.” On a self-managed host the engine reads filesystem stats for the `PGDATA` mount and cross-checks against `pg_database_size()` summed across databases plus `pg_wal` size. On Amazon RDS / Aurora it reads the CloudWatch `FreeStorageSpace` metric against allocated storage. On Cloud SQL it reads `database/disk/bytes_used` against `database/disk/quota`.
Time window	`RT` (real-time, refreshed on the live polling cycle, typically every 60 seconds).
Alert trigger	`> 90%`. Crossing 90% pages the on-call DBA. This is deliberately aggressive: the gap between 90% and a write-stopping 100% can be minutes on a busy write-heavy instance or during a runaway `pg_wal` build-up.
Threshold basis	Percentage of provisioned volume, not of any soft quota. The gauge turns amber approaching the threshold and red at breach.
What does NOT count	Storage used by sibling instances, read replicas on separate volumes, backups stored off-volume (S3, GCS), and snapshot storage. Those are billed and tracked elsewhere; this gauge is the live data volume only.
Roles	owner, engineering, operations

Calculation

The gauge is used_bytes / total_provisioned_bytes * 100, sampled on the real-time cycle. On a self-managed instance the engine derives used_bytes two ways and reconciles them:

Filesystem-level: statvfs on the PGDATA mount point gives total and available; used = total - available. This is authoritative because it captures everything on the volume, including temp files and any non-PostgreSQL data sharing the mount.
PostgreSQL-level: the sum of pg_database_size(datname) across all databases, plus the size of pg_wal, plus temporary file usage from pg_stat_database.temp_bytes. This is what PostgreSQL itself believes it is using.

The headline gauge uses the filesystem figure because that is what actually fills. The PostgreSQL-level figure is retained so the drill-down can attribute growth to heap, indexes, WAL, or temp files. On managed services there is no filesystem access, so the engine uses the provider’s own storage metric:

RDS / Aurora: 100 - (FreeStorageSpace / AllocatedStorage * 100). Aurora auto-scales storage, so the gauge there reflects used against the current allocated ceiling, which is itself elastic.
Cloud SQL: database/disk/bytes_used / database/disk/quota * 100.

WAL deserves special attention. A stuck replication slot, a failing archive_command, or a long-running base backup can cause pg_wal to grow without bound while the rest of the database is quiet. Because WAL lives on the data volume by default, a WAL blow-up shows up here first and can be the difference between 70% and 100% within the hour.

Worked example

A platform team runs a self-managed PostgreSQL 15 primary on a 500 GB gp3 volume backing an order-management service. Snapshot taken on 14 Apr 26 at 02:10 BST during the overnight batch window.

Component	Size	Share of volume
Heap (table data)	268 GB	53.6%
Indexes	121 GB	24.2%
`pg_wal`	64 GB	12.8%
Temp files (active sort spill)	19 GB	3.8%
Catalogue + misc	6 GB	1.2%
Used total	478 GB	95.6%
Free	22 GB	4.4%

The gauge reads 95.6% in red and the alert has already paged the on-call DBA, because 90% was crossed at 01:54. Reading the drill-down, two things stand out:

pg_wal at 64 GB is roughly 4x its steady-state size of around 16 GB. A quick check of pg_replication_slots shows a slot named analytics_cdc with a restart_lsn far behind the current WAL position: the downstream change-data-capture consumer has been down since 22:30, so PostgreSQL is retaining every WAL segment since then.
Temp files at 19 GB come from an overnight reporting query spilling a large sort to disk because work_mem is too small for it.

Triage decision tree the DBA follows:
  1. Is the volume about to hit 100%?  Yes (95.6%, growing ~3 GB/hour from WAL retention).
  2. Fastest safe reclaim?  Restore the analytics_cdc consumer OR, if it cannot
     be recovered quickly, drop the orphaned replication slot:
        SELECT pg_drop_replication_slot('analytics_cdc');
     This releases ~48 GB of retained WAL within one checkpoint cycle.
  3. Second reclaim: kill the overnight report spilling temp files, OR raise
     work_mem for that session so it sorts in memory.
  4. Medium term: expand the gp3 volume from 500 GB to 750 GB (online, no
     downtime) to restore headroom while the heap keeps growing.

After dropping the orphaned slot and the next checkpoint, pg_wal returns to 17 GB and the gauge falls to 84.3%, back under the alert threshold. The platform team then files a follow-up to add a separate monitor on replication-slot lag so an idle consumer never silently fills the data disk again. Three lessons platform teams should carry from this:

Disk usage is not just table growth. The scary, fast movements come from WAL retention and temp-file spill, not from the heap creeping up. When this gauge jumps suddenly, check pg_wal and temp files before assuming you simply need more storage.
90% is a real deadline, not a vanity threshold. A write-heavy primary can close the last 10% in well under an hour. The aggressive alert exists so you have time to act, not to nag.
Reclaim before you resize. Expanding a volume is the right medium-term fix, but the immediate move is almost always to release retained WAL (orphaned slots, broken archiving) or kill a runaway temp spill. Those reclaim space in minutes; a resize plus rebalance can take longer.

Sibling cards

Card	Why pair it with Database Disk Usage %	What the combination tells you
PostgreSQL Health Score	The composite that folds disk pressure into one executive number.	A red disk gauge is one of the fastest ways to drag the composite below 70.
WAL Lag Bytes (primary to standby)	WAL build-up is the most common cause of sudden disk growth.	Rising WAL lag plus rising disk usage equals a stuck slot or broken archiving retaining segments.
Last Successful Backup (hours ago)	A long-running or hung base backup can pin WAL and inflate disk.	Stale backup plus rising disk equals investigate the backup pipeline first.
Oldest Autovacuum Age (hours)	Vacuum reclaims dead-tuple space; starved vacuum bloats the heap.	High vacuum age plus high disk equals bloat is part of the problem, not just live data.
Top Tables by Dead Tuples	Pinpoints which tables are wasting volume to bloat.	The worst offenders here are where a `VACUUM FULL` or repack will reclaim the most space.
Memory Usage %	Low `work_mem` forces sorts to spill to disk as temp files.	Memory pressure plus disk pressure equals temp-file spill is consuming the volume.
Replication Lag (seconds)	A lagging standby holds back WAL recycling on the primary.	Lag plus disk growth confirms the standby is the reason WAL is not being cleared.

Reconciling against the source

Where to look in PostgreSQL and the host:

Filesystem truth (self-managed): df -h $PGDATA on the host shows the volume the gauge tracks. This is the number that matters when the disk is about to fill. PostgreSQL’s own view: SELECT pg_size_pretty(sum(pg_database_size(datname))) FROM pg_database; for total logical size, and SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir(); for the WAL directory. Per-table attribution: SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 20; Managed services: the RDS / Aurora console CloudWatch tab shows FreeStorageSpace; the Cloud SQL console shows storage usage under the instance overview.

Why our number may legitimately differ from the native tooling:

Reason	Direction	Why
Filesystem vs logical size	Vortex IQ may read higher	`df` includes WAL, temp files, filesystem reserved blocks, and any non-PostgreSQL data on the mount; `sum(pg_database_size())` does not. We headline the filesystem figure because that is what fills.
Reserved blocks	Vortex IQ slightly higher	ext4 reserves around 5% of the volume for root by default; that space counts as used against the provisioned total but is invisible to PostgreSQL.
CloudWatch sampling lag	Brief lag on RDS	`FreeStorageSpace` is published at one-minute granularity; during a fast WAL build-up the console may trail the live host by up to a minute.
Aurora elastic storage	Different baseline	Aurora grows storage automatically in 10 GB increments, so the denominator (allocated) moves; the gauge reflects used against the current ceiling, which is not fixed.
Temp file transience	Vortex IQ can spike then fall	A large sort spill inflates used bytes then releases when the query finishes; a snapshot mid-query reads higher than one taken seconds later.

Cross-source reconciliation:

Source	Expected relationship	What causes divergence
`df -h $PGDATA` (filesystem)	Should match the gauge within a percent	Reserved blocks and rounding; the filesystem is the authority during an incident.
`pg_database_size()` sum + `pg_wal` size	Will read lower than `df`	WAL, temp files, and filesystem overhead are not in the per-database sum.
RDS `FreeStorageSpace`	`100 - (free / allocated)` should match	One-minute publish lag; Aurora elastic allocation changes the denominator.

Known limitations / FAQs

The gauge says 95% but sum(pg_database_size()) only accounts for 70% of the volume. Where is the rest? Almost always WAL and temp files, neither of which is counted in pg_database_size(). Run SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir(); to size pg_wal, and check pg_stat_database.temp_bytes for active spill. Filesystem reserved blocks (around 5% on ext4) and any non-PostgreSQL files on the mount make up the remainder. The gauge tracks the filesystem because that is what actually fills. Why is the alert at 90% and not 95%? It feels early. Because the last 10% can vanish faster than you can respond. A runaway replication slot or broken archive_command retains WAL on the data volume, and on a busy primary that can add several GB per hour with no warning. The 90% page buys you the time to reclaim space before the database stops accepting writes at 100%. What actually happens when PostgreSQL hits 100% disk? The instance can no longer write WAL, so it refuses new write transactions and may panic-shutdown to protect data integrity. Autovacuum cannot run (it needs to write), so you cannot reclaim space the easy way. On RDS the instance enters a storage-full state and may become unavailable. Recovery usually means expanding the volume or, on self-managed hosts, manually freeing space (dropping orphaned slots, clearing temp files) before PostgreSQL will restart cleanly. Avoiding 100% is far cheaper than recovering from it. On Aurora the gauge looks low even though my workload is huge. Is it broken? No. Aurora separates compute from a distributed, auto-scaling storage layer that grows in 10 GB increments up to a large ceiling. The gauge shows used against the current allocated amount, which keeps expanding, so it rarely approaches 100% the way a fixed gp3 volume does. On Aurora, watch the absolute storage cost and growth rate rather than the percentage. Does dropping a large table immediately free disk? DROP TABLE and TRUNCATE release space back to the filesystem promptly. DELETE does not: it only marks rows dead, and the space is reused by future inserts only after autovacuum processes the table. To return space to the operating system after large deletes you need VACUUM FULL (which takes an exclusive lock and rewrites the table) or an online repack tool such as pg_repack. This is why a table can stay large on disk long after its logical row count drops. Can I move WAL off the data volume to protect against this? Yes. Mounting pg_wal on its own volume (via a symlink or the --waldir option at initdb) isolates WAL growth from data growth, so a stuck slot fills the WAL volume rather than stopping all writes on the data volume. Many production deployments do this. The headline gauge then tracks the data volume; the WAL volume is surfaced separately. It is a sound mitigation but does not remove the need to monitor replication slots. The number jumps up and down by a few percent within minutes. Why so noisy? Temporary files. Large sorts, hash joins, and index builds spill to disk and release when they finish, so a snapshot taken during a heavy query reads higher than one taken a moment later. If the noise is large, raise work_mem for those workloads so they sort in memory, or schedule heavy reporting off the primary. Persistent growth (not transient spikes) is the signal to act on.

Tracked live in Vortex IQ Nerve Centre

Database Disk Usage % is one of hundreds of KPI pulses Vortex IQ tracks across PostgreSQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre