Part 4: Debugging Limitations in RDS and Cloud Environments
Series Summary: This is Part 4 of a multi-part series on PostgreSQL autovacuum failures.
Even if you understand the root cause, fixing it isn’t always straightforward, especially in managed environments like Amazon RDS. In this post, we show how limited visibility complicates recovery and how we eventually resolved the issue.
Tracing the Root Cause
After confirming stalled autovacuum, we ran cluster-wide diagnostics. Using the previously shared query, we discovered a temp table with XIDs over 100 million — a clear sign that something was wrong.
But identifying the session that owned it? That was the challenge.
After encountering this issue, we turned to Google and the PostgreSQL mailing list archives.One relevant message from 2008 in the pgsql-hackers list said this behavior is "normal."
Is it technically normal? Yes. Is it ideal for production environments? Definitely not.
Digging further into the PostgreSQL source code and discussions on pgsql-hackers and pgsql-general, we found this behavior is by design:
Autovacuum allocates all available workers to the database with the oldest XID.
This is explicitly noted in the source: Autovacuum Source Code (line 1188)
A Lightbulb Moment
That’s when the lightbulb went off: this cluster contains multiple databases, so let’s check all of them.
Using the diagnostic query shared earlier, we discovered a lingering temp table with XIDs that were over 100 million transactions old. That was the smoking gun.
When You’re in Amazon RDS...
The next challenge was identifying which session owned the temp table. This proved far more difficult than expected, primarily because we were working in Amazon RDS for PostgreSQL, which restricts OS-level access. Without shell access, we could not:
- Inspect the pg_temp directory
- Match temp file names to backend PIDs
- Identify long-lived sessions by open files
That meant we had to take an educated guess.
The Fix
We contacted the client and requested permission to terminate the sessions that were likely holding the temp table open. Once approved, we ran:
SELECT pg_terminate_backend(pid);
Lo and behold! Just like that, autovacuum sprang back to life. Victory!
The Culprit: A Persistent Temp Table
The session holding open the temp tables never cleaned up after itself. It repeatedly executed the following commands within a PL/pgSQL function:
CREATE TEMP TABLE IF NOT EXISTS temp_table; TRUNCATE temp_table;
Because the session remained open indefinitely, the temp table was never dropped. In PostgreSQL, temp tables live for the entire duration of a session. If that session never ends, the table never disappears.
Monitoring Would Have Prevented This
Once we understood why autovacuum wasn’t progressing, it was clear how much time could have been saved if the client had allowed us to implement our custom-built monitoring solution from the start.
Our system continuously scans all databases and tables within a cluster and flags any that are in a critical vacuum state.
We’ve seen this scenario multiple times — clients unknowingly leave sessions open for weeks or months, with temp tables accumulating unmaintained XIDs.
Our platform would have raised an alert such as:
ALERT XXX temp_table has not been vacuumed in a long time.
⬅️ Back: Part 3 – Why Autovacuum Stops
➡️ Next: Part 5 – Temp Tables and XID Wraparound in Single-DB Clusters (coming July 8th)
Need Help?
Command Prompt is the world’s oldest dedicated Postgres services and consulting company, offering expert support for performance optimization and troubleshooting. Contact us today.
Can't wait for the full series? Receive a link to download the full series as a white paper by submitting your contact information below.
This whitepaper examines PostgreSQL autovacuum failures, focusing on temp tables and multi-database edge cases that cause bloat, slowdowns, or XID wraparound risks in complex environments.
Thank you for your interest. Schedule a call with our Founder: https://calendar.app.google/wXgXkHoiFxHwW7KA6