Part 4: Debugging Limitations in RDS and Cloud Environments

Part 4: Debugging Limitations in RDS and Cloud Environments

Series Summary: This is Part 4 of a multi-part series on PostgreSQL autovacuum failures.

Even if you understand the root cause, fixing it isn’t always straightforward, especially in managed environments like Amazon RDS. In this post, we show how limited visibility complicates recovery and how we eventually resolved the issue.

Tracing the Root Cause

After confirming stalled autovacuum, we ran cluster-wide diagnostics. Using the previously shared query, we discovered a temp table with XIDs over 100 million — a clear sign that something was wrong.

But identifying the session that owned it? That was the challenge.

After encountering this issue, we turned to Google and the PostgreSQL mailing list archives.One relevant message from 2008 in the pgsql-hackers list said this behavior is "normal."

Is it technically normal? Yes. Is it ideal for production environments? Definitely not.

Digging further into the PostgreSQL source code and discussions on pgsql-hackers and pgsql-general, we found this behavior is by design:

Autovacuum allocates all available workers to the database with the oldest XID.

This is explicitly noted in the source: Autovacuum Source Code (line 1188)

A Lightbulb Moment

That’s when the lightbulb went off: this cluster contains multiple databases, so let’s check all of them.

Using the diagnostic query shared earlier, we discovered a lingering temp table with XIDs that were over 100 million transactions old. That was the smoking gun.

When You’re in Amazon RDS...

The next challenge was identifying which session owned the temp table. This proved far more difficult than expected, primarily because we were working in Amazon RDS for PostgreSQL, which restricts OS-level access. Without shell access, we could not:

Inspect the pg_temp directory
Match temp file names to backend PIDs
Identify long-lived sessions by open files

That meant we had to take an educated guess.

The Fix

We contacted the client and requested permission to terminate the sessions that were likely holding the temp table open. Once approved, we ran:

 SELECT pg_terminate_backend(pid);

Lo and behold! Just like that, autovacuum sprang back to life. Victory!

The Culprit: A Persistent Temp Table
The session holding open the temp tables never cleaned up after itself. It repeatedly executed the following commands within a PL/pgSQL function:

CREATE TEMP TABLE IF NOT EXISTS temp_table; 
TRUNCATE temp_table;

Because the session remained open indefinitely, the temp table was never dropped. In PostgreSQL, temp tables live for the entire duration of a session. If that session never ends, the table never disappears.

Monitoring Would Have Prevented This

Once we understood why autovacuum wasn’t progressing, it was clear how much time could have been saved if the client had allowed us to implement our custom-built monitoring solution from the start.

Our system continuously scans all databases and tables within a cluster and flags any that are in a critical vacuum state.

We’ve seen this scenario multiple times — clients unknowingly leave sessions open for weeks or months, with temp tables accumulating unmaintained XIDs.

Our platform would have raised an alert such as:

ALERT XXX temp_table has not been vacuumed in a long time.

⬅️ Back: Part 3 – Why Autovacuum Stops - PostgreSQL Internals Explained

➡️ Next: Part 5 – Temp Tables and XID Wraparound in Single-DB Clusters

Need Help?

Command Prompt is the world’s oldest dedicated Postgres services and consulting company, offering expert support for performance optimization and troubleshooting. Contact us today.

Can't wait for the full series? Receive a link to download the full series as a white paper by submitting your contact information below.

This whitepaper examines PostgreSQL autovacuum failures, focusing on temp tables and multi-database edge cases that cause bloat, slowdowns, or XID wraparound risks in complex environments.

Part 4: Debugging Limitations in RDS and Cloud Environments

Tracing the Root Cause

A Lightbulb Moment

When You’re in Amazon RDS...

The Fix

Monitoring Would Have Prevented This

Need Help?

Recent blogs

Lessons From The Road: More Intention, Less Autopilot

Service Monitoring via Hazard Analysis White Paper

Critical Security Alert: Immediate Action Required for Self-Hosted SharePoint Servers (CVE-2025-53770)

Lessons from the Road: What actually matters

Part 6: Prevention and Monitoring Strategies

Lessons from the Road: Gratitude

Part 5: Temp Tables and XID Wraparound in Single-DB Clusters

Lessons from the Road: Know your footprint