Fixing foreign key deadlocks, part three

As I have posted in previous articles ( Fixing foreign key deadlocks and Part 2), I am working on reducing the lock strength required by foreign key checks. I have a working patch that solves a lot of the problems already; however it doesn't solve the one problem that I initially set to fix. It turned out to require a bit more rejiggering than I initially considered.

Note: this article assumes that you know what I have already done in the patch I posted. If you want to follow through, I suggest you read the first two links above as an introduction.

The problem

The patch I proposed removed some of the reasons that create hangs on concurrent transactions, but certain deadlocking patterns remain. Let's examine why this is so. The current isolation test framework has a test (shown below) which mirrors Joel Jacobson's test case. This easily illustrates the cases unfixed with the patch I presented.

 
setup
{
  CREATE TABLE A (
        AID integer not null,
        Col1 integer,
        PRIMARY KEY (AID)
  );
 
  CREATE TABLE B (
        BID integer not null,
        AID integer not null,
        Col2 integer,
        PRIMARY KEY (BID),
        FOREIGN KEY (AID) REFERENCES A(AID)
  );
 
  INSERT INTO A (AID) VALUES (1);
  INSERT INTO B (BID,AID) VALUES (2,1);
}
 
teardown
{
  DROP TABLE a, b;
}
 
session "s1"
setup           { BEGIN; SET deadlock_timeout = '100ms'; }
step "s1u1"     { UPDATE A SET Col1 = 1 WHERE AID = 1; }
step "s1u2"     { UPDATE B SET Col2 = 1 WHERE BID = 2; }
step "s1c"      { COMMIT; }
 
session "s2"
setup           { BEGIN; SET deadlock_timeout = '10s'; }
step "s2u1"     { UPDATE B SET Col2 = 1 WHERE BID = 2; }
step "s2u2"     { UPDATE B SET Col2 = 1 WHERE BID = 2; }
step "s2c"      { COMMIT; }
The way these tests work is by trying all possible ways of intermixing the commands of the two sessions. One of the permutations that dies due to a deadlock even with the patched code is below. The first line lists the steps taken by the two sessions, in the order in which they are executed.
 
starting permutation: s1u1 s2u1 s1u2 s2u2 s1c s2c
step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
step s1u2: <... completed>
ERROR:  deadlock detected
step s1c:  COMMIT; 
step s2c:  COMMIT; 
The locks taken by the code patched by me are:
 
s1u1: exclusive lock on table A
s2u1: exclusive lock on table B,
      KEY LOCK on table A (blocks due to the lock taken by s1u1)
s1u2: exclusive lock on table B (blocks due to lock taken in s2u1); 
      deadlock detected
s2u2: exclusive lock on table B, KEY LOCK on table A.

It's easy to see that it still deadlocks: when session 2 goes to acquire the KEY LOCK in the tuple in table A (step s2u1), it blocks because of the previous UPDATE. So with my new code, you can UPDATE a tuple with a KEY LOCK in it (which you cannot with the current code), but you can't KEY LOCK a tuple that's under UPDATE. This asymmetry is pretty weird, but it does makes some sense: if the UPDATE were to touch the indexed fields, you are screwed, whereas the KEY LOCK already knows whether the UPDATE touched indexed fields or not.

The solution

In discussion, Noah Misch came up with the following idea: create a new lock mode for tuples, FOR KEY UPDATE. With this new lock mode, along with KEY SHARE which replaces the lone new mode my patch proposes (KEY LOCK), the lock conflict table for tuples would look like this:

  1. FOR KEY SHARE conflicts with FOR KEY UPDATE
  2. FOR SHARE conflicts with FOR KEY UPDATE, FOR UPDATE
  3. FOR UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE
  4. FOR KEY UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE, FOR KEY SHARE

In this conflict table, most things work as you would expect. FOR UPDATE is the lock mode acquired by SELECT FOR UPDATE; similarly with FOR SHARE. KEY SHARE would be the lock mode acquired by the foreign key code: the lightest lock available. The new mode KEY UPDATE mode would be acquired by UPDATE and DELETE:

  1. DELETE always takes FOR KEY UPDATE lock on the tuple being deleted (so it behaves as currently: everyone else is blocked until the delete is completed)
  2. UPDATE takes FOR KEY UPDATE lock if at least one column being updated is unique-indexed; FOR UPDATE lock if not.

Why does this solve the problem?

With the new proposal, the sequence of locks acquired by the example above would look like this:

 
s1u1: FOR UPDATE in table A
s2u1: FOR UPDATE in table B,
      FOR KEY SHARE in table A
s1u2: FOR UPDATE in table B (blocks),
      FOR KEY SHARE in table A
s2u2: FOR UPDATE in table B (supposedly does not block because we already have it),
      FOR KEY SHARE in table A

So s2 completes, and then s1 is unblocked and commits.

With Noah's proposal, things get more symmetrical. This problem would be fixed by letting the KEY SHARE lock be acquired even while UPDATE is touching the tuple — assuming that the UPDATE is not touching anything indexed by an unique index. Obviously, if the UPDATE were to try to update the AID column (the primary key), all bets would be off — there would be a deadlock. Supposedly, that's OK; there's nothing we can do about it, anyway.

Note that the FOR KEY UPDATE lock doesn't need to be exposed at the SQL level, because it is taken internally by DELETE and UPDATE. Only the new mode FOR KEY SHARE would be exposed, because it needs to be used by the RI trigger via the SPI interface. (If there was another way to send "row marks" down, we could avoid exposing yet another nonstandard lock mode clause.)

The catch

The "small" problem with this idea is that a tuple might now be locked by multiple transactions in more than one of three modes, FOR KEY SHARE, FOR SHARE and FOR UPDATE. This presents a representational problem: how do we know which transaction holds which lock mode? There's certainly not room in the tuple header for all this info, so we need to be looking elsewhere.

To fix this, we would need to store more info about the locking MultiXact; in addition to the comprising Xids, we would need to store the locktype for each, two bits per member transaction (say 00 = FOR KEY SHARE, 01 = FOR SHARE, 10 = FOR UPDATE). This requires some thought: we could expand the current pg_multixact SLRU areas to give room for the extra info; I think this would foreclose the possibility of us using MultiXacts for other purposes (We don't currently use them for anything other than locking tuples, but who knows). The other possibility would be storing those bits in new SLRU areas but that seems pretty painful.

Let's fix another problem in passing, too

This proposal also shows promise in fixing the problem with row locks being lost on aborted subtransactions; see the first Caution frame in the SELECT reference documentation:

Caution

Avoid locking a row and then modifying it within a later savepoint or PL/pgSQL exception block. A subsequent rollback would cause the lock to be lost. For example:

 
BEGIN;
SELECT * FROM mytable WHERE key = 1 FOR UPDATE;
SAVEPOINT s;
UPDATE mytable SET ... WHERE key = 1;
ROLLBACK TO s;

After the ROLLBACK, the row is effectively unlocked, rather than returned to its pre-savepoint state of being locked but not modified. [...]

Basically, the problem there is that the tuple locks are forgotten when a subtransaction updates a tuple; if the subtransaction aborts, the locks are not un-forgotten, because the field in the tuple header (Xmax in the original tuple) that keeps track of it has been overwritten with the Xid of the updating transaction.

What we could do is create a MultiXact when this is detected: an updater doesn't just stash its own Xid in the Xmax field, but instead it creates a new MultiXact with itself and whatever was previously in the Xmax. In order to do this, we would need to use the fourth bit pattern to represent those that grab KEY UPDATE on the tuple. While obviously no one could be interested in touching that tuple again, remembering the previous state would be useful in case the updating subtransaction aborted.

Conclusion

I will be working to find the best way to implement this. The changes I describe require patching stuff in a lot more places than I initially intended. It seems to me that we would end up with a much better, more concurrent implementation if we follow this path. Stay tuned.