Idle-in-Transaction Backends, kpi - Vortex IQ Help Centre

Card class: Sensitivity • Category: Capacity

At a glance

The number of backend connections sitting in the idle in transaction state: a client opened a transaction with BEGIN, did some work, and then stopped, holding the transaction open without committing or rolling back. This is one of PostgreSQL’s most distinctive and dangerous failure modes. An idle-in-transaction backend keeps locks held, pins the transaction horizon so autovacuum cannot reclaim dead tuples, and occupies a connection slot. A handful is normal churn; a sustained pile of them is a leaking application connection pool slowly strangling the database.


What it tracks	The count of backends where `pg_stat_activity.state = 'idle in transaction'` (and the related `idle in transaction (aborted)`). “Stuck transactions hold locks and bloat the pool. PostgreSQL-distinctive risk.”
Data source	`pg_stat_activity`, filtered to `state = 'idle in transaction'`. The drill-down also reports `xact_start` (when the open transaction began), `state_change` (when it went idle), `wait_event`, and the `application_name` so you can attribute the leak to a specific service.
Time window	`RT` (real-time, evaluated on the live polling cycle).
Alert trigger	`> 10 sustained 5m`. More than ten idle-in-transaction backends held continuously for five minutes pages the on-call DBA. The “sustained” qualifier matters: brief spikes during normal commit churn are ignored; a plateau is the signal.
Why it matters in PostgreSQL specifically	Because of MVCC, an open transaction pins the oldest visible transaction ID. Autovacuum cannot remove any dead tuple newer than that horizon, so a single long-idle transaction can block vacuum across the entire cluster, causing table bloat and, in the extreme, transaction-ID wraparound risk.
What does NOT count	Plain `idle` backends (connected but with no open transaction, which are harmless and expected from a pool), `active` backends (running a query right now), and `idle in transaction (aborted)` is counted but flagged separately because it cannot do further work until rolled back.
Roles	owner, engineering, operations

Calculation

The headline is a straight count from pg_stat_activity:

SELECT count(*)
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted)');

The alert applies a sustain filter on top: the count must stay above 10 across the full five-minute window before paging, so the card does not fire on the momentary blips that happen as transactions commit and the pool recycles. The drill-down enriches each backend with the metrics that tell you how dangerous it is:

now() - xact_start is the age of the open transaction. A backend idle in transaction for 200 milliseconds is fine; one idle for 45 minutes is holding the vacuum horizon back by 45 minutes.
now() - state_change is how long it has been idle in this transaction specifically (since the last statement finished).
application_name and client_addr attribute the leak to a service or host, which is how you find the offending code path.
pid is what you feed to pg_terminate_backend(pid) if you need to kill it.

The single most important derived figure is the oldest xact_start among all idle-in-transaction backends, because that is the one pinning the cluster-wide vacuum horizon. The engine surfaces it prominently: a high count of short idle transactions is annoying, but one ancient idle transaction is what actually blocks vacuum and bloats your tables.

Worked example

A platform team runs the primary database behind an order-processing API. The API uses an ORM with an application-side connection pool of 40 connections. Snapshot taken on 03 Jun 26 at 14:20 BST after the card paged at 14:13.

`application_name`	Count idle-in-tx	Oldest `xact_start`	Idle for
order-api	23	13:51	29 min
order-api	(same pool)	14:02	18 min
reporting-worker	2	14:18	90 s

The card reads 25 idle-in-transaction backends, well over the threshold of 10, and has been above it for seven minutes. The oldest open transaction started at 13:51, so autovacuum has not been able to advance its horizon for 29 minutes. A check of pg_stat_user_tables confirms n_dead_tup on the busy orders table is climbing steadily because vacuum cannot reclaim anything committed after 13:51. Reading the detail, the pattern is unmistakable: 23 of the 25 belong to a single service, order-api. The team pulls the most recent deploy and finds the culprit: a code change wrapped an external payment-gateway call inside an open database transaction.

The anti-pattern the deploy introduced (simplified):

  BEGIN;
  UPDATE orders SET status = 'paying' WHERE id = $1;
  -- transaction left OPEN while the app awaits the payment gateway
  result = call_payment_gateway(...)   <-- 2 to 30 seconds, sometimes hangs
  UPDATE orders SET status = 'paid' WHERE id = $1;
  COMMIT;

Why it piles up:
  - The gateway call is slow and occasionally stalls.
  - Every in-flight payment holds a transaction open for the call's duration.
  - At peak checkout volume, dozens of payments are in flight simultaneously,
    so dozens of transactions sit idle-in-transaction waiting on the network.
  - The slow payments pin the vacuum horizon and exhaust the pool.

The immediate mitigation and the durable fix:

Stop the bleeding now. Set a guard so these cannot accumulate forever: ALTER SYSTEM SET idle_in_transaction_session_timeout = '30s'; SELECT pg_reload_conf();. Any transaction idle longer than 30 seconds is terminated by PostgreSQL automatically. This protects the database while the code is fixed.
Fix the code. Move the external gateway call outside the transaction. Open a transaction only to write the result: read the order, call the gateway with no open transaction, then BEGIN; UPDATE ...; COMMIT; to record the outcome. Network calls must never sit inside an open database transaction.

After the timeout is applied, the count falls to 3 within a minute as the stalled transactions are reaped, the vacuum horizon advances, and n_dead_tup on orders starts dropping at the next autovacuum pass. Three lessons platform teams should carry:

Idle-in-transaction is almost always an application bug, not a database problem. The fix lives in the code: a transaction left open across a slow operation (a network call, a user think-time, a forgotten commit). The database is the victim, not the cause.
The oldest transaction matters more than the count. Twenty backends idle for two seconds each are a minor pool nuisance. One backend idle for 40 minutes blocks vacuum cluster-wide and is the real emergency. Read the oldest xact_start, not just the headline number.
idle_in_transaction_session_timeout is your seatbelt. Setting a sane timeout (often 30 to 60 seconds for OLTP) means a leaking code path cannot pin the vacuum horizon indefinitely. It is a safety net, not a substitute for fixing the code, but every production primary should have one.

Sibling cards

Card	Why pair it with Idle-in-Transaction Backends	What the combination tells you
Connection Pool Saturation %	Idle-in-tx backends occupy pool slots without doing work.	Rising idle-in-tx plus rising saturation equals the leak is eating your connection budget.
Connections In Use	The raw connection count that idle-in-tx inflates.	Connections high but QPS flat is a classic idle-in-tx fingerprint.
Oldest Autovacuum Age (hours)	A long idle transaction pins the vacuum horizon.	High vacuum age co-occurring with idle-in-tx confirms the transaction is blocking vacuum.
Top Tables by Dead Tuples	Blocked vacuum lets dead tuples accumulate.	Dead tuples climbing while idle-in-tx is high equals bloat caused by the stuck transaction.
Deadlocks (last 5m)	Idle-in-tx backends hold locks, raising contention.	Idle-in-tx plus deadlocks equals long-held locks are colliding with active work.
Query Error Rate %	Aborted idle transactions and lock waits surface as errors.	Error rate rising alongside idle-in-tx points to lock-wait timeouts.
Connection Pool at >90% Saturation	The alert that fires when leaked connections fill the pool.	An entry here often traces straight back to idle-in-transaction leakage.

Reconciling against the source

Where to look in PostgreSQL:

The live count: SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction'; The offenders, oldest first: SELECT pid, application_name, client_addr, state, now() - xact_start AS tx_age, now() - state_change AS idle_for, wait_event FROM pg_stat_activity WHERE state IN ('idle in transaction', 'idle in transaction (aborted)') ORDER BY xact_start; The vacuum horizon they pin: compare the oldest xact_start against SELECT datname, age(datfrozenxid) FROM pg_database; to see the wraparound pressure they are causing. Kill one safely: SELECT pg_terminate_backend(<pid>); rolls back the open transaction and frees the slot. Managed services: RDS / Aurora and Cloud SQL all expose pg_stat_activity directly via a normal client connection; there is no provider abstraction for this metric.

Why our number may legitimately differ from a manual query:

Reason	Direction	Why
Sampling moment	Either way	`pg_stat_activity` is a live snapshot; a count taken at a different instant catches a different set of in-flight transactions, especially under heavy commit churn.
Sustain filter	We may read lower for alerting	The headline count is instantaneous, but the alert only fires when the count stays above 10 for the full five minutes; a manual query catching a transient spike will not match the alert state.
Aborted state inclusion	We may read higher	We count `idle in transaction (aborted)` alongside `idle in transaction`; a query filtering only the former misses the aborted ones.
PgBouncer pooling	Possible mismatch	In transaction-pooling mode PgBouncer multiplexes client transactions onto fewer server connections; what the app sees and what the server reports can differ.

Cross-source reconciliation:

Source	Expected relationship	What causes divergence
`pg_stat_activity` count	Should match the headline at the same instant	Snapshot timing under churn.
Application pool metrics	Should roughly correlate	The app counts checked-out connections; not all checked-out connections are idle-in-transaction, only the leaking ones.
PgBouncer `SHOW POOLS`	Server-side connection view	Transaction pooling decouples client and server connection counts.

Known limitations / FAQs

What is the difference between idle and idle in transaction? A plain idle backend is connected but has no transaction open: it has committed or rolled back its last work and is waiting for the next statement. That is completely normal and expected from a connection pool. An idle in transaction backend has an open transaction (it ran BEGIN or an implicit transaction and then stopped) and is holding locks and pinning the vacuum horizon while doing nothing. The first is harmless; the second is the one this card warns about. Why does one idle transaction cause so much damage? Because of MVCC. PostgreSQL keeps old row versions visible to any transaction that might still need them, determined by the oldest open transaction’s snapshot. An idle-in-transaction backend keeps that horizon frozen, so autovacuum cannot remove any dead tuple created after it started, anywhere in the cluster. One long-idle transaction can therefore bloat every busy table and, left for days, push the database toward transaction-ID wraparound. The damage is wildly out of proportion to the single stuck connection. Should I just kill them with pg_terminate_backend? Killing is a valid emergency action: it rolls back the open transaction and frees the slot. But it treats the symptom, not the cause. If a code path is leaking transactions, the backends will pile right back up. The durable fix is in the application (do not hold a transaction open across slow operations) plus a safety net in the database (idle_in_transaction_session_timeout). Reach for pg_terminate_backend to stop an active emergency, then fix the code. What is a sensible value for idle_in_transaction_session_timeout? For OLTP workloads, 30 to 60 seconds is a common starting point: long enough not to interrupt legitimate transactions, short enough to reap genuine leaks before they pin vacuum for long. Batch or ETL workloads that legitimately hold long transactions may need a higher per-session value set explicitly for those connections. Setting it cluster-wide to a sane default is one of the highest-value protective settings on a production primary. The count spikes briefly to 15 then falls back. Is that a problem? Probably not. Brief spikes during peak commit churn are normal, which is exactly why the alert requires the count to stay above 10 for a sustained five minutes. A momentary blip that clears on its own is the pool doing its job. A plateau that does not clear is the leak you need to act on. Watch for sustained elevation and a climbing oldest xact_start, not transient peaks. Does PgBouncer in transaction-pooling mode hide this problem? It can change where you see it. In transaction-pooling mode, PgBouncer returns a server connection to the pool at transaction end, so an application that leaks a transaction holds a PgBouncer client slot and, behind it, a server connection that stays idle-in-transaction. The server-side count this card reads still rises, but you may also need PgBouncer’s SHOW POOLS and SHOW CLIENTS to attribute it. Transaction pooling does not prevent idle-in-transaction; it just adds a layer to trace through. Can read replicas have idle-in-transaction backends? Yes. Read-only transactions on a standby (for reporting or read scaling) can be left open just as easily, and on a standby a long-running query or open transaction can conflict with WAL replay, causing the standby to either delay applying changes or cancel the query (hot_standby_feedback and max_standby_streaming_delay govern this). So idle-in-transaction on a replica has its own distinct risk: it can stall replication apply. The card tracks each instance separately.

Tracked live in Vortex IQ Nerve Centre

Idle-in-Transaction Backends is one of hundreds of KPI pulses Vortex IQ tracks across PostgreSQL and 70+ other ecommerce connectors. Nerve Centre runs the detection layer; Vortex Mind investigates the cause when something moves; Ask Viq lets you interrogate any number in plain English. Start for free or book a demo to see this metric running on your own data.

​At a glance

​Calculation

​Worked example

​Sibling cards

​Reconciling against the source

​Known limitations / FAQs

​Tracked live in Vortex IQ Nerve Centre