Fixing foreign key deadlocks

I've been commissioned to work on foreign keys. More specifically, on the problem that when foreign keys are verified, they sometimes obtain locks that are stronger than really needed. This causes some operations to block unnecessarily and, perhaps more aggravating, some other operations to deadlock.

This problem has been known for a very long time, and it affects many users to varying degrees. The most recent detailed discussion about this problem took place on August 2010 on pgsql-hackers.

To recapitulate on this problem a bit: in the aboriginal code, foreign key checks obtained FOR UPDATE locks on referenced tuples, meaning that they were exclusively locked for the duration of the transaction doing the check. This was so strong a lock that it had a severe impact to the performance of applications that expected to concurrently access and modify tables with foreign key relationships. Consequently, many people used to drop their foreign keys just to get a reasonable concurrency level.

We partly fixed this by introducing SELECT FOR SHARE in 8.1, which allowed checks to be run concurrently. This had an enormous positive impact to concurrency, so people then began to use foreign keys more extensively.

But when you start raising the load level, at some point another problem becomes apparent: the locks taken are a stronger than strictly necessary, causing pauses and sometimes deadlocks.

Joel Jacobson of Glue Finance illustrated it with an example in the post referenced above, which can be seen in action in this screencast. His test case was this:

DROP TABLE B;
DROP TABLE A;

CREATE TABLE A (
AID integer not null,
Col1 integer,
PRIMARY KEY (AID)
);


CREATE TABLE B (
BID integer not null,
AID integer not null,
Col2 integer,
PRIMARY KEY (BID),
FOREIGN KEY (AID) REFERENCES A(AID)
);

INSERT INTO A (AID) VALUES (1);
INSERT INTO B (BID,AID) VALUES (2,1);

Process 1:                             Process 2:
BEGIN;
                                       BEGIN;
UPDATE A SET Col1 = 1 WHERE AID = 1;
                                       UPDATE B SET Col2 = 1 WHERE BID = 2;
UPDATE B SET Col2 = 1 WHERE BID = 2;
                                       UPDATE B SET Col2 = 1 WHERE BID = 2;

In Joel's example, he was getting an unexpected deadlock when session 2 updated the row the second time. Why, he asked, wasn't the process getting blocked the first time around? His initial explanation was incorrect, but the underlying reason for his problem was that that transaction was getting a shared lock on the referenced row (the one on table A), which it really didn't need except to ensure that the row didn't go away — that is, to make sure the foreign key constraint remained satisfied until it could commit.

Put simply, certain operations are blocked when there is no need for it. Consider this simple example:
Session 1:

CREATE TABLE foo (a int PRIMARY KEY, b text);
CREATE TABLE bar (a int NOT NULL REFERENCES foo);
INSERT INTO foo VALUES (42);

BEGIN;
INSERT INTO bar VALUES (42);
Session 2:
UPDATE foo SET b = 'Hello World' ;

Note that session 2 is now blocked. And the reason it's blocked is pretty simple: session 1 is holding a shared lock on the row in table foo, and the UPDATE in session 2 wants to acquire an exclusive lock to be able to modify it.

The pgsql-hackers discussion contained some very useful ideas on how to attack this problem, which is what I intend to do. If you've been affected by this problem and would like to discuss a solution, please let me know in a comment below. I'll be explaining my proposal in a forthcoming article.