... and that was my last day

This has been cooking for a while now, and now it's time to open it up: July 31st, 2012 was my last day with Command Prompt, Inc.

I joined Command Prompt in October 2005. Back then I wasn't a very prolific blogger, it seems, because it took me six months to get this fact out. I haven't improved much since then, even though boss Josh Drake kept telling me to publish my thoughts and ideas on various PostgreSQL-related matters.

During my time with them I had the opportunity to work on many interesting things. I got a number of patches into PostgreSQL, some of them thanks to some Command Prompt customer sponsoring it; I worked on the ...

Read More

The Write–Ahead Log

WAL (acronym for Write–Ahead Log) is the mechanism that Postgres uses to implement durability of data changes in the face of abrupt system crashes.

The WAL system's central concept is that changes to data files must be written to such data files only after those changes are guaranteed to have been logged to WAL, that is, when WAL records have been flushed to permanent storage. (This flushing is accomplished by the fsync system call or equivalent mechanism.) If we follow this procedure, we need not flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using WAL — any changes that Postgres ...

Read More

Another day, another recovery

This is something I have seen many times now: a customer calls us because they lost some data and they want help recovering.

Now you must be wondering: surely if they lost data they can just recover from their last backup, right? Right — they had that. However we know that pg_dump takes a while to run and is stressful on the server, so it's normally run just once a day or so. What happens if you've been running almost a full work day since your last backup? It's a lot of data to lose.

And in turn you must be wondering: surely they can easily run a standby server with some form of WAL replication (streaming ...

Read More

Decoding infomasks

Come on, admit it: you've always wanted to display the infomask bits from a tuple header in a human-readable manner, but you've never gotten around to it and you still keep htup.h in display while you peek around tuples.

Fortunately, that time is now past! Here's a short and simple recipe to decode the bits for your reading pleasure. Gone is the htup.h cheat sheet. Here's what you need:

create type infomask_bit_desc as (mask varbit, symbol text);

create or replace function infomask(msk int, which int) returns text
language plpgsql as $$
        r infomask_bit_desc;
        str text = '';
        append_bar bool = false;
        for r in select * from infomask_bits(which) loop
                if (msk::bit(16) & r.mask)::int <> 0 then
                        if append_bar then
                                str = str || '|';
                        end ...

Read More

Fixing foreign key deadlocks, part three

As I have posted in previous articles ( Fixing foreign key deadlocks and Part 2), I am working on reducing the lock strength required by foreign key checks. I have a working patch that solves a lot of the problems already; however it doesn't solve the one problem that I initially set to fix. It turned out to require a bit more rejiggering than I initially considered.

Note: this article assumes that you know what I have already done in the patch I posted. If you want to follow through, I suggest you read the first two links above as an introduction.

The problem

The patch I proposed removed some of the reasons that create hangs on concurrent transactions, but ...

Read More

PGCon 2011 Developer's Meeting picture

For those that are curious (and which for some reason don't follow the #pgcon tag at twitter), here's a picture of the attendees of Developer's Meeting.

Point the missiles here

Back row, from left to right: Robert Haas, Selena Deckelmann, Marko Kreen, KaiGai Kohei, Stephen Frost, Magnus Hagander , Robert Treat, Tom Lane, Heikki Linnakangas, Mark Wong, Josh Berkus, Kevin Grittner, Dimitri Fontaine, Koichi Suzuki, Andrew Dunstan, Fujii Masao, Jeff Davis, Greg Smith, Tatsuo Ishii, Dave Page, Simon Riggs.

Front row: Greg Stark, David Wheeler, David Fetter, Bruce Momjian, Teodor Sigaev.

If you want to see a larger version, click on it.

Read More

An attempt at finding glaring btree problems

Some time ago, a customer came to us with a strange vacuuming problem. Their regular vacuum job would die with a message such as this one:

vacuumdb: vacuuming of database "clint_app" failed:
  ERROR: failed to re-find parent key in index "work_items_pkey" for deletion target page 6100

Eventually, it turned out that their storage firmware had some glitch that caused things to go wrong in random ways, and corruption in various places was detected.

However, before this was discovered, many other errors were found and reported. After a lot of back and forth, we decided to write a simple tool to verify the data contained in btree indexes. This tool would scan the index structure and traverse the tree, reporting places ...

Read More

This is what table bloat looks like

I got curious about a bloat problem on a customer's system today. The statistics as calculated by normal tools/queries say that one of the tables is 2x bloated. Another table is 6x bloated. And so on. For some reason I wanted to see what it looked like in a graphical way, so I threw together this simple query:

select s, coalesce(count, 0)
  from (select block, count(*)
          from (select split_part(tid, ',', 1)::int as block,
                       split_part(tid, ',', 2)::int as offset
                  from (select trim(both '()' from textin(tidout(ctid))) as tid
                          from flight_details
                       ) a
               ) b
      group by block) c
 right join (select s from generate_series(0, 28000) s) d on (c.block = d.a)
order by a ;

(This ...

Read More

Recovering a lost-and-found database

Last week, a company's only PostgreSQL database server suffered an UPS failure. When they found they couldn't connect to it afterwards, they called Command Prompt to investigate.

The most important lesson to be learned here is: you need to make very sure you have backups. Even if the strategy is too simple, or even if they are taken only once a fortnight, they are going to save your neck someday just by being there. In time you will find the way to improve your backups: make them more selective, more frequent, less intrusive, whatever. Not having any backup at all means that if you lose your database, you may be out of business.

This could have very well ...

Read More

Grant schema usage to 2500 users? No can do!

It all started with a help request from a someone on IRC: he has about 2500 roles, and all of them have been granted USAGE privileges on a schema. He went on to create a bunch more and grant the same privilege, but it failed:

CREATE ROLE some_user_2501;
GRANT USAGE on schema genetics to some_user_2501;

ERROR:  row is too big: size 8168, maximum size 8164


Now, let this be said: this is wrong design. It causes slowness, due to the way those privileges are stored. A much better way to go about this is to create a single role, grant the privileges to that role, and the grant that role to all other roles. So:

CREATE ROLE genetic_reader;

Read More

Fixing foreign key deadlocks submitted

A week ago, I submitted my patch to fix the foreign key lock problem. What I propose is almost exactly what was described in my previous blog posts, with a couple of differences.

The first is that I decided to only check unique indexes, not all indexes. This is good because you can have other indexes for performance and they will not affect locking for foreign keys. I noticed that it could be done easily without performance impact, and it had been requested by at least two independent people.

(For the hardcore hackers among you, the way I did this was by adding a second bitmapset to the relcache RelationData struct, which lists columns part of unique indexes. This second ...

Read More

Automatically pulling into a committer's Git repository

I was a bit unhappy because I couldn't keep my "bare" Git repository up-to-date unattended — I needed to be at my workstation to be able to do a git fetch, because it needs my SSH passphrase.

I didn't have this problem with CVS because I kept an rsync copy of the anonymous CVS repository, from which my regular trees where checked out. (My committer's checkouts were separate, which was annoying, but I considered that problem solved with the jump to Git.)

Yesterday I had an epiphany that this could be solved very easily: just add a new remote to the anonymous clone, which doesn't require any SSH key to be involved, and so can run ...

Read More

Fixing foreign key deadlocks, part 2 revisited

While trying to implement SELECT FOR KEY LOCK at the lock manager level, I stumbled across the problem that I need to represent the lock faithfully in the lock manager's terms. And since I previously mentioned that FOR KEY LOCK would conflict with FOR SHARE, I get in trouble — it's not easy to see which lock mode to use (if there is one, which I doubt).

So I revisited that decision: a FOR KEY LOCK does not conflict with FOR SHARE, and this allows them to use the same ShareLock mode.

This has two consequences:

  1. After the tuple is locked by two transactions or more in the two different modes, there's no way to figure out ...

Read More

Fixing foreign key deadlocks, part 2

In the previous article, I explained the problem with foreign key checks checks obtaining too strong of a lock, and promised that we would be attempting to fix it.

Here is my proposal:

  1. Create a new SELECT locking clause. For now, we're calling it SELECT FOR KEY LOCK
  2. This will acquire a new type of lock in the tuple, dubbed a "keylock".
  3. This lock will conflict with DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE.
  4. It also conflicts with UPDATE if the UPDATE modifies an indexed attribute.

That's the gist of it. The end effect is that you are allowed to UPDATE a tuple that's being used in a foreign key check, as long as you don ...

Read More

Fixing foreign key deadlocks

I've been commissioned to work on foreign keys. More specifically, on the problem that when foreign keys are verified, they sometimes obtain locks that are stronger than really needed. This causes some operations to block unnecessarily and, perhaps more aggravating, some other operations to deadlock.

This problem has been known for a very long time, and it affects many users to varying degrees. The most recent detailed discussion about this problem took place on August 2010 on pgsql-hackers.

To recapitulate on this problem a bit: in the aboriginal code, foreign key checks obtained FOR UPDATE locks on referenced tuples, meaning that they were exclusively locked for the duration of the transaction doing the check. This was so strong a ...

Read More

The old Berkeley Postgres code

Some days ago, I was reading some patch from the in-progress commitfest, and happened to notice this comment in src/include/tcopprot.h:

 *    This file was created so that other c files could get the two
 *    function prototypes without having to include tcop.h which single
 *    handedly includes the whole f*cking tree -- mer 5 Nov. 1991

The weird thing about this was that there's no tcop.h file on the tree. I thought that it must have been removed somewhere along the long history of code cleanups and rearrangements. I was curious to see what this file looked like, so I went to the very first commit in our CVS, which turns out to be this one in ...

Read More