Recovering a lost-and-found database

Last week, a company's only PostgreSQL database server suffered an UPS failure. When they found they couldn't connect to it afterwards, they called Command Prompt to investigate.

The most important lesson to be learned here is: you need to make very sure you have backups. Even if the strategy is too simple, or even if they are taken only once a fortnight, they are going to save your neck someday just by being there. In time you will find the way to improve your backups: make them more selective, more frequent, less intrusive, whatever. Not having any backup at all means that if you lose your database, you may be out of business.

This could have very well been the case here, but we worked very hard to ensure this didn't happen to them. Here's the story.

What we were told

The customer received the following error message when trying to connect to the broken database

psql: FATAL:  could not open relation 1663/17409944/1259: No such file or directory

Anyone familiar with the Postgres' system catalogs will immediately realize that something is very, very wrong here: 1259 is the OID and file name of the pg_class catalog. And if that file is missing, who knows what else might be missing as well?

Upon logging in into the machine and listing the database directory, it becomes quite clear that things are going very wrong indeed. Not only is pg_class missing — all the other catalogs, and their indexes, are missing as well.

What we found

Since the server had had an unclean shutdown, the obvious place to look for the missing files was lost+found. This is the directory where the fsck utility stores lost (and found) files that it cannot find a proper directory entry for; one peculiarity of lost+found entries is that since the directory entry is gone, they have lost their file names. fsck restores them to funny-looking names such as #1801237981 where the number is its inode number, and leaves to you the task of figuring out what's the correct name to attach to it (and on which directory to place it.)

The way Postgres stores table data is by using plain files. Each file is just a bunch of data pages. If you can't read binary stuff, it's just a bunch of gibberish. There's nothing on the file itself that tells you what table it belongs to. If the file name is lost, such as in this case, there's no direct way to figure it out again. You're pretty much on your own.

To make the problem slightly worse, we found a whopping total of 480 files in lost+found. Figuring out all of those was going to take quite a while!

What we thought

We're all told at school: the first thing to do when confronted with a huge problem is to break it up into smaller pieces. Then, instead of having to solve a single huge, impossible problem, you have to solve a large bunch of not-so-huge, hopefully tractable problems. And what better way to partition this huge problem than classifying the files.

Now, as I said earlier, Postgres data files just look like binary gibberish to an outsider. To someone in the know, however, they open all their secrets: for example, RedHat's immensely useful pg_filedump. pg_filedump can easily tell if a file is an index file, a sequence, or a table; moreover, it is possible to classify the table files according to the number of columns.

What we did (to sort out the files)

After creating a copy of lost+found that I could play with, I ran this:

mkdir btrees sequences
for file in *; do
	if pg_filedump-8.3 -R 0 $file | grep -q 'Sequence: '; then
		mv $file sequences
for file in *; do
	if pg_filedump-8.3 -R 0 $file | grep -q 'BTree Meta Data'; then
		mv $file btrees

This moved the index and sequence files out of the way. The indexes can be recreated with REINDEX; the sequences are impossible to tell apart from one another, so it's easier to deal with them by setting to the values they need to have according to the data in the tables. But that's for later.

The second thing to do is classify all the remaining table files according to the number of attributes that can be found in each one. This was a bit more involved:

for file in *; do
	attrs=$(~cmd/pg_filedump-8.3 -fi $file |
		grep ' Attributes: [0-9]\+' |
		sed -e 's/.*Attributes: \([0-9]\+\) \+Size.*/\1/' |
		sort -r |
		head -1)
	if [ ! -z "$attrs" ]; then
		mkdir -p $attrs
		mv $file $attrs

What we did (to restore the catalogs)

After this has been done, what you have is a bunch of subdirectories named after a number, which is the number of attributes that all the tables whose files ended up in there have. So if you're looking for pg_class file, you need to look in subdir 27 (our customer was running 8.2). More likely than not, there will be multiple files in that directory. The easiest way (that I know) to figure out which file it is is to look at all of them with pg_filedump again (this time with the -fi options), and look for well-known strings in the data area; for pg_class, the name pg_class itself suffices, though the name of any table would do. Once you have identified the file, all you have to do is move it in place (17409944 is the database OID and directory name, and 1259 is, remember, pg_class's file name):

mv /data/lost+found.copy/#175254092 /var/lib/pgsql/data/17409944/1259

Restoring most catalogs is not hard if you follow this technique. But you're not at a point where this database is usable yet: attempts to connect to it will still fail, because you haven't restored the indexes of those catalogs. Fortunately, doing this is simple: just start a standalone backend with indexes disabled and run a REINDEX SYSTEM of the database in question:

postgres --single -P -D /var/lib/pgsql/data the_affected_db
PostgreSQL stand-alone backend 8.2.20
backend>  reindex system the_affected_db;

NOTICE:  table "pg_class" was reindexed
NOTICE:  table "pg_authid" was reindexed
NOTICE:  table "pg_statistic" was reindexed

I didn't save the exact output here, but if I remember correctly, this failed initially and I had to reindex one catalog separately first (pg_type I think). Also, I think it emitted a warning message for each index that should have been there but wasn't, saying that file such-and-such could not be removed, but this is harmless.

After you've restored all the catalogs, the database can be started normally. You can't query the user tables yet, because they are gone — but at least you can query the catalogs which is enough to know what tables there should be, and particularly how many columns each one has.

What we did (to restore user tables)

This query proved very useful:
select relnamespace, relname, relhasoids, relfilenode,
       relpages, relpages * 8192 as size, reltuples,
       (select array_to_string(array(select attlen
				       from pg_attribute
				      where attrelid = pg_class.oid and
                                            attnum > 0),
                               ',')) as attrlens,
      pg_ls_dir as file
from pg_class left join
    (select *
       from pg_ls_dir('base/17409944')
      where pg_ls_dir ~ '^[0-9]*$') foo
              on (relfilenode = pg_ls_dir::oid)
where relkind in ('r', 't') and relnatts = :natts
order by relpages desc;
This query returns, for each table, not only the basic information in the pg_class catalog, but also whether it has OIDs and whether there is already a file on disk for that table; and lastly, the length of all of its attributes. For example, in a fresh database I see this with natts set to 5:

result of query above
relnamespace relname relhasoids relfilenode relpages size reltuples attrlens file
11 pg_amop f 2602 3 24576 445 4,4,2,1,4 2602
10621 sql_packages f 10752 1 8192 10 -1,-1,-1,-1,-1 10752
10621 sql_parts f 10757 1 8192 9 -1,-1,-1,-1,-1 10757
10621 sql_implementation_info f 10742 1 8192 12 -1,-1,4,-1,-1 10742
10621 sql_sizing_profiles f 10767 0 0 0 4,-1,-1,4,-1 10767

(The reason I also display the relhasoids column is that some tables had OIDs but not all. I guess the database was started in a Postgres release that had OIDs enabled by default, and later moved to one that had them disabled.)

The attrlens columns is interesting: for example, a table with three integer columns (which would be displayed as 4,4,4) should only have 12 bytes of data in each tuple (not counting header); so if I look into a file with pg_filedump and there's a tuple with 16 bytes, I know it must belong to some other table. If I see a boolean column in the last position, there is a most obvious trailing lone byte in each tuple. And so on. In most cases, this examination of tuple length in pg_filedump allowed me to determine the correct file for each table.

The tuple's “infomask” bits are also useful. A table with only fixed-size columns (i.e. no -1 in it, indicative of a “varlena” column such as text, varchar or numeric), for example, will not show up in pg_filedump as having the HAS_VARWIDTH bit set. Tables where all columns are NOT NULL cannot have tuples that have the HASNULL bit set. In a few cases, I had to resort to decoding the null bitmask in order to compare it with the nullable columns that each table is declared to have. And there's the OID business I mentioned above.

Another cue was the table size in pages and tuples as stored in pg_class.reltuples and pg_class.relpages. These numbers are not entirely trustworthy, because they are only updated by commands such as VACUUMM and ANALYZE (in some cases I found discrepancies up to 20% with the actual data in the table); however they should be the same ballpark.

I also resorted to examining string data and some common sense; for example, a table called “instruments” could very well hold a row that says “spectrograph” in it.

In a few cases, however, things were a bit more difficult: for example, there were several tables with three integer columns. In those cases, the only resource I had was to look into the data and try to figure out if it matched what the foreign keys were telling me. This was the most time-consuming part of the job. I assume that matching TOAST tables to their respective main tables would be complex. Fortunately they didn't have many TOAST tables either.

In all cases, after finding a relationship between table and file, all that remained was to move the file in place and run a REINDEX of that table. If the server was up I could ran a quick query to determine that the data was valid (I figured that if I messed up badly, the server would crash upon trying to read bad data. This didn't happen to me this time; maybe it was pure luck). And after a few tables were up, I could start running queries to determine that the FKs validated correctly.

What we conclude

As I said at the beginning, the most important conclusion is backups are really important; while you may believe yourself protected, such as having a UPS to prevent power failure problems, those can still fail.

However there is also something to be said for filesystems, ext3 in this case. I can excuse a filesystem from losing the name of a file I just created, or maybe a page or three of contents of an old file. I cannot find any reason for a filesystem to lose a large bunch of files that have existed for months. Something went really wrong with it. I was told on twitter that ext3 is known for losing files. Also, its journal does not even have a CRC: if it gets corrupt, the replay algorithm may be told to write any random garbage anywhere on the filesystem. This cannot be sane. I guess it's time to start looking elsewhere.

I'm also having second thoughts about this idea that Postgres data files are not self-identifying. It would have helped my job immensely to have a simple header that told me what table OID the file belonged to (it should also carry the segment number, though this wasn't important to me because all tables were under 1 GB.) This is something I shall propose to the community for the 9.2 release.

Thanks for reading this far

The procedure I described here is not exactly what I did; rather, its a summary of what I found was the most productive way to go about it, after fiddling with some other alternatives.

I hope this will be useful to someone trying to figure these things out. At the same time, I really hope no one sees him or herself dealing with such an unprotected breakdown of the database.