Another day, another recovery

This is something I have seen many times now: a customer calls us because they lost some data and they want help recovering.

Now you must be wondering: surely if they lost data they can just recover from their last backup, right? Right — they had that. However we know that pg_dump takes a while to run and is stressful on the server, so it's normally run just once a day or so. What happens if you've been running almost a full work day since your last backup? It's a lot of data to lose.

And in turn you must be wondering: surely they can easily run a standby server with some form of WAL replication (streaming or continuous archive recovery) that gets the data from the master in a more timely fashion. Yes, they had that too. Or at least, they thought they had ... and this is where things start to look awry.

So why, you continue to wonder, did they think they had a working replica, when in fact they didn't? That's easy to answer: they had proper monitoring in place. At least, they thought it was proper; in fact, it wasn't, because the monitoring script contained some sort of error and it was not working at all. It reported that the replica was working, when it was not.

Reality was that their replica had broken fourty five days earlier and they hadn't noticed. Yes, really, 45 days.

But the master was still happily working all that time, and they had no problems at all ... except that their disk was slowly filling up. They had this little cleanup script that removed xlog segments that were older than 30 days. This script was run by cron regularly.

Now you have to understand that this is a pretty efficient script. You tell it to free space, and it frees a lot of space. So they run it manually, and it freed space; so much space it freed, in fact, that the database stopped working and it wouldn't start. Not precisely the most comforting of situations, regardless of the amount of free space you now have.

It turns out that instead of running it on the xlog archive dir, they had ran it on the Postgres base directory.

To recapitulate, by now I'm counting five mistakes:

  1. Broken replica monitoring that wrongly returns that a broken replica is in good health
  2. Breaking of a replica server
  3. Ill-conceived xlog archival design that consumes arbitrary space on the master
  4. A data removal script that does not carefully filter files to delete
  5. A DBA running said script under duress, who is likely to make a mistake, makes that mistake

And that's where we, Command Prompt Inc., enter the picture.

If you're a Postgres hacker familiar with the underlying storage layout, recovering from this type of breakage is relatively easy. If you're a user, it's impossible. They're lucky that they had that 45-days-old replica. Without that, it would have been a lot harder (it's possible to reconstruct enough catalog state by hand to let you pg_dump such a table. It's a lot more laborious, however.) What I did was copy enough files corresponding to the heap of system catalogs from the replica into the master, until a standalone backend was able to start and run a REINDEX SYSTEM. I couldn't just copy all files, because that would have overwritten catalogs that had been touched sometime in the last 30 days, which I didn't want. Fortunately there was nothing in the intervening 15 day interval after the slave broke and before the 30 day file deletion limit.

When that was done, it was possible to start up postmaster and examine status of tables. This caused some more errors, this time related to missing pg_clog files, and one related to a short pg_multixact file. I managed those by copying pg_clog files from the slave (I had to extend one using dd because some blocks weren't present in the slave either), and creating a dummy pg_multixact file. To extend the pg_clog file, which was the one piece of guessing I did, I examined the other pg_clog files to see how likely transaction aborts were compared to transaction commits. Commits were twenty times more frequent than aborts, so I decided to simply fill the missing pages with the 0x55 pattern, which corresponds to "all transactions committed."

With this, some tables would raise errors about missing underlying files when queried. Other tables would work fine. The next step was obvious: produce one dump for each table, discard those that failed, keep the good ones. The theory of operation was that older files, those that were erased by the script, corresponded to tables that had not been touched in a long time. So they could be recovered from the nightly pg_dump without loss of data. All other tables had recent activity, and so the new dumps would take precedence over the nightly dump.

(This would have failed if tables had more than one segment, and some segment had been deleted while others had been kept. I examined table data from pg_class and concluded that this hadn't happened.)

From there it's easy to proceed: just restore selectively from either the nightly dump, or from the separate dump I provided. This is tedious but relatively easy, and they took over at this point. They were happy to tell me, the next day, that they had gotten all data back. (If you attempt this at home, make sure you pg_dump the resulting database and then restore that. By doing this, you ensure that the system catalogs are consistent, and that the data you restore passes all the declared constraints. I hope they did not skip this step.)

I guess there must be a lesson for them somewhere in this whole episode. I have left those for them to learn. The lesson I learned was this: we need a utility that helps us dump the pg_filenode.map file.