When a 18TB backup just stopped

Sometimes a fix for production, is only in git.

A customer asked us to restore an 18TB PostgreSQL database to a specific point in time. Routine work. Tested runbook, pgBackRest doing its job, backups and WAL right where they should be.

Then WAL replay during recovery just stopped.

No error. Nothing in the logs, even at debug5. No I/O. No CPU. The startup process sitting there doing nothing at all.

So we went where the logs could not. /proc//wchan on the startup process showed it parked on a futex, futex_wait_queue. The database had deadlocked against itself while replaying its own WAL.

The trail led to MultiXact handling. It shows up in PostgreSQL 14, 15, and 16, and it was introduced in 14.23, 15.18, and 16.14, by the change titled "Fix multixact backwards-compatibility with CHECKPOINT race condition." Closing one race opened a self-deadlock when replaying WAL generated by an older minor version.

The same class of problem can affect replication between minor versions. The upstream fix moves the lock acquisition later in SimpleLruWriteAll and is slated for the next point release. Thread: https://www.postgresql.org/message-id/flat/4016ab8b-84c0-4606-8a77-67ad5b157bfc%40iki.fi

For the record, pgBackRest was flawless. This was not a backup problem. It was Postgres deadlocking on itself during recovery, and that does not show up in a log file. You find it by reading the process state and the source.

Three takeaways:

  • Stay on top of minor releases and actually read the notes.
  • The patch that closes one race can open another. When the logs go quiet, the answer is still there. /proc, wchan, and the source do not lie.
  • Three decades into PostgreSQL and it still hands us a new one.