At a glance
The number of backend connections sitting in theidle in transactionstate: a client opened a transaction withBEGIN, did some work, and then stopped, holding the transaction open without committing or rolling back. This is one of PostgreSQL’s most distinctive and dangerous failure modes. An idle-in-transaction backend keeps locks held, pins the transaction horizon so autovacuum cannot reclaim dead tuples, and occupies a connection slot. A handful is normal churn; a sustained pile of them is a leaking application connection pool slowly strangling the database.
| What it tracks | The count of backends where pg_stat_activity.state = 'idle in transaction' (and the related idle in transaction (aborted)). “Stuck transactions hold locks and bloat the pool. PostgreSQL-distinctive risk.” |
| Data source | pg_stat_activity, filtered to state = 'idle in transaction'. The drill-down also reports xact_start (when the open transaction began), state_change (when it went idle), wait_event, and the application_name so you can attribute the leak to a specific service. |
| Time window | RT (real-time, evaluated on the live polling cycle). |
| Alert trigger | > 10 sustained 5m. More than ten idle-in-transaction backends held continuously for five minutes pages the on-call DBA. The “sustained” qualifier matters: brief spikes during normal commit churn are ignored; a plateau is the signal. |
| Why it matters in PostgreSQL specifically | Because of MVCC, an open transaction pins the oldest visible transaction ID. Autovacuum cannot remove any dead tuple newer than that horizon, so a single long-idle transaction can block vacuum across the entire cluster, causing table bloat and, in the extreme, transaction-ID wraparound risk. |
| What does NOT count | Plain idle backends (connected but with no open transaction, which are harmless and expected from a pool), active backends (running a query right now), and idle in transaction (aborted) is counted but flagged separately because it cannot do further work until rolled back. |
| Roles | owner, engineering, operations |
Calculation
The headline is a straight count frompg_stat_activity:
now() - xact_startis the age of the open transaction. A backend idle in transaction for 200 milliseconds is fine; one idle for 45 minutes is holding the vacuum horizon back by 45 minutes.now() - state_changeis how long it has been idle in this transaction specifically (since the last statement finished).application_nameandclient_addrattribute the leak to a service or host, which is how you find the offending code path.pidis what you feed topg_terminate_backend(pid)if you need to kill it.
xact_start among all idle-in-transaction backends, because that is the one pinning the cluster-wide vacuum horizon. The engine surfaces it prominently: a high count of short idle transactions is annoying, but one ancient idle transaction is what actually blocks vacuum and bloats your tables.
Worked example
A platform team runs the primary database behind an order-processing API. The API uses an ORM with an application-side connection pool of 40 connections. Snapshot taken on 03 Jun 26 at 14:20 BST after the card paged at 14:13.application_name | Count idle-in-tx | Oldest xact_start | Idle for |
|---|---|---|---|
| order-api | 23 | 13:51 | 29 min |
| order-api | (same pool) | 14:02 | 18 min |
| reporting-worker | 2 | 14:18 | 90 s |
pg_stat_user_tables confirms n_dead_tup on the busy orders table is climbing steadily because vacuum cannot reclaim anything committed after 13:51.
Reading the detail, the pattern is unmistakable: 23 of the 25 belong to a single service, order-api. The team pulls the most recent deploy and finds the culprit: a code change wrapped an external payment-gateway call inside an open database transaction.
- Stop the bleeding now. Set a guard so these cannot accumulate forever:
ALTER SYSTEM SET idle_in_transaction_session_timeout = '30s'; SELECT pg_reload_conf();. Any transaction idle longer than 30 seconds is terminated by PostgreSQL automatically. This protects the database while the code is fixed. - Fix the code. Move the external gateway call outside the transaction. Open a transaction only to write the result: read the order, call the gateway with no open transaction, then
BEGIN; UPDATE ...; COMMIT;to record the outcome. Network calls must never sit inside an open database transaction.
n_dead_tup on orders starts dropping at the next autovacuum pass.
Three lessons platform teams should carry:
- Idle-in-transaction is almost always an application bug, not a database problem. The fix lives in the code: a transaction left open across a slow operation (a network call, a user think-time, a forgotten commit). The database is the victim, not the cause.
- The oldest transaction matters more than the count. Twenty backends idle for two seconds each are a minor pool nuisance. One backend idle for 40 minutes blocks vacuum cluster-wide and is the real emergency. Read the oldest
xact_start, not just the headline number. idle_in_transaction_session_timeoutis your seatbelt. Setting a sane timeout (often 30 to 60 seconds for OLTP) means a leaking code path cannot pin the vacuum horizon indefinitely. It is a safety net, not a substitute for fixing the code, but every production primary should have one.
Sibling cards
| Card | Why pair it with Idle-in-Transaction Backends | What the combination tells you |
|---|---|---|
| Connection Pool Saturation % | Idle-in-tx backends occupy pool slots without doing work. | Rising idle-in-tx plus rising saturation equals the leak is eating your connection budget. |
| Connections In Use | The raw connection count that idle-in-tx inflates. | Connections high but QPS flat is a classic idle-in-tx fingerprint. |
| Oldest Autovacuum Age (hours) | A long idle transaction pins the vacuum horizon. | High vacuum age co-occurring with idle-in-tx confirms the transaction is blocking vacuum. |
| Top Tables by Dead Tuples | Blocked vacuum lets dead tuples accumulate. | Dead tuples climbing while idle-in-tx is high equals bloat caused by the stuck transaction. |
| Deadlocks (last 5m) | Idle-in-tx backends hold locks, raising contention. | Idle-in-tx plus deadlocks equals long-held locks are colliding with active work. |
| Query Error Rate % | Aborted idle transactions and lock waits surface as errors. | Error rate rising alongside idle-in-tx points to lock-wait timeouts. |
| Connection Pool at >90% Saturation | The alert that fires when leaked connections fill the pool. | An entry here often traces straight back to idle-in-transaction leakage. |
Reconciling against the source
Where to look in PostgreSQL:The live count:Why our number may legitimately differ from a manual query:SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction';The offenders, oldest first:SELECT pid, application_name, client_addr, state, now() - xact_start AS tx_age, now() - state_change AS idle_for, wait_event FROM pg_stat_activity WHERE state IN ('idle in transaction', 'idle in transaction (aborted)') ORDER BY xact_start;The vacuum horizon they pin: compare the oldestxact_startagainstSELECT datname, age(datfrozenxid) FROM pg_database;to see the wraparound pressure they are causing. Kill one safely:SELECT pg_terminate_backend(<pid>);rolls back the open transaction and frees the slot. Managed services: RDS / Aurora and Cloud SQL all exposepg_stat_activitydirectly via a normal client connection; there is no provider abstraction for this metric.
| Reason | Direction | Why |
|---|---|---|
| Sampling moment | Either way | pg_stat_activity is a live snapshot; a count taken at a different instant catches a different set of in-flight transactions, especially under heavy commit churn. |
| Sustain filter | We may read lower for alerting | The headline count is instantaneous, but the alert only fires when the count stays above 10 for the full five minutes; a manual query catching a transient spike will not match the alert state. |
| Aborted state inclusion | We may read higher | We count idle in transaction (aborted) alongside idle in transaction; a query filtering only the former misses the aborted ones. |
| PgBouncer pooling | Possible mismatch | In transaction-pooling mode PgBouncer multiplexes client transactions onto fewer server connections; what the app sees and what the server reports can differ. |
| Source | Expected relationship | What causes divergence |
|---|---|---|
pg_stat_activity count | Should match the headline at the same instant | Snapshot timing under churn. |
| Application pool metrics | Should roughly correlate | The app counts checked-out connections; not all checked-out connections are idle-in-transaction, only the leaking ones. |
PgBouncer SHOW POOLS | Server-side connection view | Transaction pooling decouples client and server connection counts. |
Known limitations / FAQs
What is the difference betweenidle and idle in transaction?
A plain idle backend is connected but has no transaction open: it has committed or rolled back its last work and is waiting for the next statement. That is completely normal and expected from a connection pool. An idle in transaction backend has an open transaction (it ran BEGIN or an implicit transaction and then stopped) and is holding locks and pinning the vacuum horizon while doing nothing. The first is harmless; the second is the one this card warns about.
Why does one idle transaction cause so much damage?
Because of MVCC. PostgreSQL keeps old row versions visible to any transaction that might still need them, determined by the oldest open transaction’s snapshot. An idle-in-transaction backend keeps that horizon frozen, so autovacuum cannot remove any dead tuple created after it started, anywhere in the cluster. One long-idle transaction can therefore bloat every busy table and, left for days, push the database toward transaction-ID wraparound. The damage is wildly out of proportion to the single stuck connection.
Should I just kill them with pg_terminate_backend?
Killing is a valid emergency action: it rolls back the open transaction and frees the slot. But it treats the symptom, not the cause. If a code path is leaking transactions, the backends will pile right back up. The durable fix is in the application (do not hold a transaction open across slow operations) plus a safety net in the database (idle_in_transaction_session_timeout). Reach for pg_terminate_backend to stop an active emergency, then fix the code.
What is a sensible value for idle_in_transaction_session_timeout?
For OLTP workloads, 30 to 60 seconds is a common starting point: long enough not to interrupt legitimate transactions, short enough to reap genuine leaks before they pin vacuum for long. Batch or ETL workloads that legitimately hold long transactions may need a higher per-session value set explicitly for those connections. Setting it cluster-wide to a sane default is one of the highest-value protective settings on a production primary.
The count spikes briefly to 15 then falls back. Is that a problem?
Probably not. Brief spikes during peak commit churn are normal, which is exactly why the alert requires the count to stay above 10 for a sustained five minutes. A momentary blip that clears on its own is the pool doing its job. A plateau that does not clear is the leak you need to act on. Watch for sustained elevation and a climbing oldest xact_start, not transient peaks.
Does PgBouncer in transaction-pooling mode hide this problem?
It can change where you see it. In transaction-pooling mode, PgBouncer returns a server connection to the pool at transaction end, so an application that leaks a transaction holds a PgBouncer client slot and, behind it, a server connection that stays idle-in-transaction. The server-side count this card reads still rises, but you may also need PgBouncer’s SHOW POOLS and SHOW CLIENTS to attribute it. Transaction pooling does not prevent idle-in-transaction; it just adds a layer to trace through.
Can read replicas have idle-in-transaction backends?
Yes. Read-only transactions on a standby (for reporting or read scaling) can be left open just as easily, and on a standby a long-running query or open transaction can conflict with WAL replay, causing the standby to either delay applying changes or cancel the query (hot_standby_feedback and max_standby_streaming_delay govern this). So idle-in-transaction on a replica has its own distinct risk: it can stall replication apply. The card tracks each instance separately.