CMD: Alvaro Herrera's Blog http://www.commandprompt.com/blogs/alvaro_herrera/ Command Prompt Blog Feed: Alvaro Herrera's Blog Thu, 02 Aug 2012 09:12:42 -0700 www.commandprompt.com http://www.commandprompt.com/images/small_mammoth.gif Command Prompt Mammoth http://www.commandprompt.com/blogs/alvaro_herrera/ 46 46 http://www.commandprompt.com/blogs/alvaro_herrera/2012/08/_and_that_was_my_last_day/ Thu, 02 Aug 2012 09:12:42 -0700 ... and that was my last day http://www.commandprompt.com/blogs/alvaro_herrera/2012/08/_and_that_was_my_last_day/ <p> This has been cooking for a while now, and now it's time to open it up: July 31st, 2012 was my last day with Command Prompt, Inc.</p> <p>I joined Command Prompt in October 2005. Back then I wasn't a very prolific blogger, it seems, because it took me six months to <a href="http://advogato.org/person/alvherre/diary/6.html">get this fact out</a>. I haven't improved much since then, even though boss Josh Drake kept telling me to publish my thoughts and ideas on various PostgreSQL-related matters.</p> <p>During my time with them I had the opportunity to work on many interesting things. I got a number of patches into PostgreSQL, some of them thanks to some Command Prompt customer sponsoring it; I worked on the Command Prompt proprietary PostgreSQL fork, Mammoth Replicator, and on PL/php; and I got to talk to very smart people, learn a lot, and generally have tons of fun.</p> <p>I enjoyed my time with Command Prompt very much; the colleagues I leave are very capable and knowledgeable. I particularly have to thank Alexey Klyukin with whom I shared thousands of hours of work in all kinds of projects.</p> <p> But I decided a needed a change, so I'm leaving Command Prompt. I had some job offers at PGCon, and I have already decided what I'm going to do; I look forward to many more years of PostgreSQL hacking and bug-hunting.</p> <p>Command Prompt has been growing in business and number of employees lately; I hope that trend continues and that they have great success. I heartfully thank Command Prompt very much for the great opportunity they gave me all these years.</p> <p>See you in pgsql-hackers.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2012/08/_and_that_was_my_last_day/ Business PostgreSQL http://www.commandprompt.com/blogs/alvaro_herrera/2012/07/the_writendashahead_log/ Thu, 19 Jul 2012 21:38:53 -0700 The Write&ndash;Ahead Log http://www.commandprompt.com/blogs/alvaro_herrera/2012/07/the_writendashahead_log/ <p>WAL (acronym for Write&ndash;Ahead Log) is the mechanism that Postgres uses to implement durability of data changes in the face of abrupt system crashes.</p ><p >The WAL system's central concept is that changes to data files must be written to such data files only after those changes are guaranteed to have been logged to WAL, that is, when WAL records have been flushed to permanent storage. (This flushing is accomplished by the fsync system call or equivalent mechanism.) If we follow this procedure, we need not flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using WAL &mdash; any changes that Postgres determines not to have been written will be redone from the WAL records.</p> <p>One of the major benefits of using WAL is a significantly reduced number of disk flushes, since only the log file needs to be flushed to disk at the time of transaction commit; in multiuser environments, commits of many transactions may be accomplished with a single flush of the transaction log file. Furthermore, the log file is written sequentially, and so the cost of flushing it is much less than the cost of flushing the data pages, especially in instances where you have a lot of small transactions working on different parts of the database.</p> <h1>Checkpoints</h1> <p>Put simply, a checkpoint is a point in the transaction sequence at which all data files (where tables and indexes reside) are guaranteed to be written and flushed to disk to reflect the data modified in memory by regular operation. All those changes have previously been written and flushed to WAL.</p> <p>There are two configuration parameters that determine the interval between checkpoints. The checkpoint_segments parameter determines the number of transaction log segments (16 MB in size apiece) that will be allowed to be filled before a checkpoint is forced. Similarly, at most checkpoint_timeout seconds will be allowed to elapse between one checkpoint and the next one, regardless of the number of segments in use.</p> <p>The downside of having too much time/data between checkpoints is that in case of a crash, the time it takes to restore the data from the transaction logs is proportional to the number of segments to restore. Also, the space used in the pg_xlog/ subdirectory grows.</p> <h1>Spread checkpoints</h1> <p> Starting in Postgres 8.3, checkpoints are executed slowly, that is, instead of trying to write and flush pages as quickly as possible, an attempt is made to spread the writes so that the checkpoint itself does not consume all the available I/O bandwidth. Regular operation can, of course, continue while the checkpoint is in progress. In fact, users are happy with this delay behavior, because they observe that query time does not suffer from a latency spike during checkpoints, thanks to checkpointing leaving more I/O bandwidth available for them to use. Such latency spikes were quite noticeable in previous releases.</p> <p>How much to spread checkpoints is configurable with the checkpoint_completion_target parameter; this value is a fraction of the checkpoint interval (in either number of segments or the timeout setting, whichever is sooner). For example, the default value of 0.5 means to take half the time of a checkpoint interval to do the writes.</p> <h1>Full page writes</h1> <p>Another consideration is that of &#147;full page writes.&#148; For the first change of a data file page (each 8kb unit) after a checkpoint, a full page image must be written to WAL. So the first change on a page after a checkpoint should be considered to be very expensive in terms of WAL space used, but subsequent ones before the next checkpoint occupy a comparatively small space in the log file.</p> <p>This has one very important effect which is important to consider when configuring checkpoints: too small an interval between two checkpoints might mean that full page writes use almost all the available space in WAL before the next checkpoint, causing a new checkpoint to happen and thus a large number of full page images again. Increasing the checkpoint interval might mean a significantly reduced WAL traffic.</p> <p> Full page writes can be turned off in postgresql.conf, but this is strongly discouraged; the circumstances in which it is safe to do so are extremely limited, and not completely understood.</p> <h1>References</h1> <p>All of the information found on this article can also be found in the official Postgres documentation: <a href="http://www.postgresql.org/docs/current/static/wal-configuration.html">http://www.postgresql.org/docs/current/static/wal-configuration.html</a> and links therein.</p > http://www.commandprompt.com/blogs/alvaro_herrera/2012/07/the_writendashahead_log/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2012/02/another_day_another_recovery/ Tue, 14 Feb 2012 15:01:20 -0700 Another day, another recovery http://www.commandprompt.com/blogs/alvaro_herrera/2012/02/another_day_another_recovery/ <p>This is something I have seen many times now: a customer calls us because they lost some data and they want help recovering.</p> <p>Now you must be wondering: surely if they lost data they can just recover from their last backup, right? Right &mdash; they had that. However we know that pg_dump takes a while to run and is stressful on the server, so it's normally run just once a day or so. What happens if you've been running almost a full work day since your last backup? It's a lot of data to lose.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2012/02/another_day_another_recovery/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/11/decoding_infomasks/ Wed, 30 Nov 2011 13:20:49 -0700 Decoding infomasks http://www.commandprompt.com/blogs/alvaro_herrera/2011/11/decoding_infomasks/ <p>Come on, admit it: you've always wanted to display the infomask bits from a tuple header in a human-readable manner, but you've never gotten around to it and you still keep htup.h in display while you peek around tuples.</p> <p>Fortunately, that time is now past! Here's a short and simple recipe to decode the bits for your reading pleasure. Gone is the htup.h cheat sheet. Here's what you need:</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/11/decoding_infomasks/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/ Tue, 09 Aug 2011 09:00:36 -0700 Fixing foreign key deadlocks, part three http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/ <p>As I have posted in previous articles ( <a href="http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/">Fixing foreign key deadlocks</a> and <a href="http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/">Part 2</a>), I am working on reducing the lock strength required by foreign key checks. I have <a href="http://archives.postgresql.org/message-id/1311807810-sup-1055@alvh.no-ip.org">a working patch</a> that solves a lot of the problems already; however it doesn't solve the one problem that I initially set to fix. It turned out to require a bit more rejiggering than I initially considered.</p> <p>Note: this article assumes that you know what I have already done in the patch I posted. If you want to follow through, I suggest you read the first two links above as an introduction.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/05/pgcon_2011_developers_meeting_picture/ Wed, 18 May 2011 22:05:10 -0700 PGCon 2011 Developer's Meeting picture http://www.commandprompt.com/blogs/alvaro_herrera/2011/05/pgcon_2011_developers_meeting_picture/ <p>For those that are curious (and which for some reason don't follow the #pgcon tag at twitter), here's a picture of the attendees of Developer's Meeting.</p> <p align="center"><a href="http://www.flickr.com/photos/alvherre/5735413679/" title="Point the missiles here por alvherre, en Flickr"><img src="http://farm4.static.flickr.com/3444/5735413679_d0e805e244.jpg" width="500" height="143" alt="Point the missiles here"></a></p> <p>Back row, from left to right: Robert Haas, Selena Deckelmann, Marko Kreen, KaiGai Kohei, Stephen Frost, Magnus Hagander , Robert Treat, Tom Lane, Heikki Linnakangas, Mark Wong, Josh Berkus, Kevin Grittner, Dimitri Fontaine, Koichi Suzuki, Andrew Dunstan, Fujii Masao, Jeff Davis, Greg Smith, Tatsuo Ishii, Dave Page, Simon Riggs.</p> <p>Front row: Greg Stark, David Wheeler, David Fetter, Bruce Momjian, Teodor Sigaev.</p> <p>If you want to see a larger version, click on it.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/05/pgcon_2011_developers_meeting_picture/ http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/an_attempt_at_finding_glaring_btree_problems/ Fri, 29 Apr 2011 12:15:39 -0700 An attempt at finding glaring btree problems http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/an_attempt_at_finding_glaring_btree_problems/ <p>Some time ago, a customer came to us with a strange vacuuming problem. Their regular vacuum job would die with a message such as this one:</p> <pre> vacuumdb: vacuuming of database "clint_app" failed: ERROR: failed to re-find parent key in index "work_items_pkey" for deletion target page 6100 </pre> <p>Eventually, it turned out that their storage firmware had some glitch that caused things to go wrong in random ways, and corruption in various places was detected.</p> <p>However, before this was discovered, many other errors were found and reported. After a lot of back and forth, we decided to write a simple tool to verify the data contained in btree indexes. This tool would scan the index structure and traverse the tree, reporting places on which the nodes and leaves didn't align with expectations.</p> <p>I've mentioned this tool at various times in mailing lists and such, and given to a few customers. It has proven useful to determine whether a given problem is some sort of hardware problem that's causing widespread usage, or something localized. I've now published the code in <a href="https://github.com/alvherre/pgbtreecheck">Github</a>.</p> <p>This is very rough around the edges, and there are more checks that could be written given sufficient interest.</p> <p>If you find it useful, please let me know in a comment.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/an_attempt_at_finding_glaring_btree_problems/ http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/this_is_what_table_bloat_looks_like/ Tue, 12 Apr 2011 13:01:49 -0700 This is what table bloat looks like http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/this_is_what_table_bloat_looks_like/ <p>I got curious about a bloat problem on a customer's system today. The statistics as calculated by normal tools/queries say that one of the tables is 2x bloated. Another table is 6x bloated. And so on. For some reason I wanted to see what it looked like in a graphical way, so I threw together this simple query:</p> <pre> select s, coalesce(count, 0) from (select block, count(*) from (select split_part(tid, ',', 1)::int as block, split_part(tid, ',', 2)::int as offset from (select trim(both '()' from textin(tidout(ctid))) as tid from flight_details ) a ) b group by block) c right join (select s from generate_series(0, 28000) s) d on (c.block = d.a) order by a ; </pre> <p> (This is a 8.2 system; in newer servers you can simplify the inner query a bit). </p> <p> After setting appropriate parameters in psql (<tt>\pset format unaligned</tt> and <tt>\pset fieldsep ' '</tt> and <tt>\o /tmp/population.data</tt>), I gave the output to Gnuplot using this simple script: <pre> set terminal jpeg size 10000,600 set output "population.jpg" plot "/tmp/population.data" using 1:2 with points </pre> <p> When viewed zoomed out, it looks like this: </p> <p> <a href="http://www.flickr.com/photos/alvherre/5614332144/" title="table population"><img src="http://farm6.static.flickr.com/5103/5614332144_d8d9367ab7.jpg" width="500" height="148" alt="population"></a> </p> <p> This plot represents the number of tuples in each page. The plateau at the left is a very densely populated group of pages &mdash; this is optimal space usage. Then in the middle you can see a cloud which is closer to the bottom. Finally, the straight line at the far right represents the pages less than 28000 that the table has. (The query could be refined to avoid this tail.) </p> <p> Ideally you should have a bit over 10% of dead space on each page on average, if autovacuum has default parameters. In this case, there is clearly a problem after the first sixth of the table: the dots are too low. This indicates bloat in those pages.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/04/this_is_what_table_bloat_looks_like/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/recovering_a_lost-and-found_database/ Fri, 18 Mar 2011 11:20:44 -0700 Recovering a lost-and-found database http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/recovering_a_lost-and-found_database/ <p>Last week, a company's only PostgreSQL database server suffered an UPS failure. When they found they couldn't connect to it afterwards, they called <a href="http://www.commandprompt.com/">Command Prompt</a> to investigate.</p> <p>The most important lesson to be learned here is: <strong>you need to make very sure you <a href="http://www.commandprompt.com/blogs/joshua_drake/2010/07/a_better_backup_with_postgresql_using_pg_dump/">have backups</a></strong>. Even if the strategy is too simple, or even if they are taken only once a fortnight, they are going to save your neck someday just by being there. In time you will find the way to improve your backups: make them more selective, more frequent, less intrusive, whatever. Not having any backup <b>at all</b> means that if you lose your database, you may be out of business.</p> <p>This could have very well been the case here, but we worked very hard to ensure this didn't happen to them. Here's the story.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/recovering_a_lost-and-found_database/ PostgreSQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/grant_schema_usage_to_2500_users_no_can_do/ Tue, 01 Mar 2011 12:11:02 -0700 Grant schema usage to 2500 users? No can do! http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/grant_schema_usage_to_2500_users_no_can_do/ <p>It all started with a help request from a someone on IRC: he has about 2500 roles, and all of them have been granted USAGE privileges on a schema. He went on to create a bunch more and grant the same privilege, but it failed:</p> <pre> CREATE ROLE some_user_2501; GRANT USAGE on schema genetics to some_user_2501; ERROR: row is too big: size 8168, maximum size 8164 </pre> <p>Oops.</p> <p>Now, let this be said: this is wrong design. It causes slowness, due to the way those privileges are stored. A much better way to go about this is to create a single role, grant the privileges to that role, and the grant that role to all other roles. So:</p> <pre> CREATE ROLE genetic_reader; GRANT USAGE ON schema genetics TO genetic_reader; GRANT genetic_reader TO some_user_0001, some_user_0002, ...; </pre> <p>That said, there is always people who want to do things their own way, and so this answer isn't enough. They want to know how to make their original GRANT statement work. Here's how.</p> <p>The reason it fails, as the message says, is that the row is too big. Supposedly, we have solved this problem in Postgres by using TOAST tables: when attributes get too large, they are compressed and sent to chunked storage to the toast table. This works fine ... as long as there <b>is</b> a toast table to start with. Turns out that not all system catalogs have one.</p> <p>This is the list of system catalogs with ACL columns in them, and the OID of their toast table. If the OID is zero, it means it has no toast table and thus they will cause failures in case someone tries to grant privileges to umpteen users.</p> <pre> select relname, reltoastrelid from pg_class where oid in ( select attrelid from pg_attribute where (attname like '%acl' and atttypid = 'aclitem[]'::regtype) or (attname like '%options' and atttypid = 'text[]'::regtype)) and relkind = 'r'; </pre> <table border="0"> <tr> <th align="center">relname</th> <th align="center">reltoastrelid</th> </tr> <tr valign="top"> <td align="left">pg_attribute</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_default_acl</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_largeobject_metadata</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_pltemplate</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_database</td> <td align="right">2844</td> </tr> <tr valign="top"> <td align="left">pg_tablespace</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_class</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_proc</td> <td align="right">2836</td> </tr> <tr valign="top"> <td align="left">pg_foreign_data_wrapper</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_namespace</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_foreign_server</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_user_mapping</td> <td align="right">0</td> </tr> <tr valign="top"> <td align="left">pg_language</td> <td align="right">0</td> </tr> </table> <p>(13 rows)<br /> </p> <p>(I have also added "options" columns, because those could also cause problems, though the number of possible options is limited, so this is unlikely to cause any problems in practice.)</p> <p>Notice that of those, only <tt>pg_proc</tt> and <tt>pg_database</tt> catalogs have toast tables. Having it for <tt>pg_proc</tt> is understandable: that's the catalog where function source code is stored, and that tends to get large, frequently. But <tt>pg_database</tt>? The only explanation is that someone got bit by the limitation on granting CONNECT privileges to a large number of roles. So it follows that all the remaining system catalogs should be modified in this way, too.</p> <p>To work around this limitation, you can manually create a TOAST table to the system catalog. To do this, you need to start the server in a special mode that lets you modify the system catalogs:</p> <pre> postgres -O </pre> <p>Connect to it, and do something like this:</p> <pre> ALTER TABLE pg_namespace ADD COLUMN foo text; ALTER TABLE pg_namespace DROP COLUMN foo; </pre> <p>Then stop the server and restart it normally. Now pg_namespace, the system catalog where schema permissions are stored, has a TOAST table and you can issue all those thousands of <tt>GRANT .. ON SCHEMA</tt> you've always wished (yeah, right).</p> <p>While this is not recommended, I wonder if we should go ahead and fix the problem by having the system automatically create toast tables on those system catalogs.</p> <p>What do you think?</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/03/grant_schema_usage_to_2500_users_no_can_do/ PostgreSQL http://www.commandprompt.com/blogs/alvaro_herrera/2011/01/fixing_foreign_key_deadlocks__submitted/ Thu, 20 Jan 2011 13:54:54 -0700 Fixing foreign key deadlocks submitted http://www.commandprompt.com/blogs/alvaro_herrera/2011/01/fixing_foreign_key_deadlocks__submitted/ <p>A week ago, <a href="http://archives.postgresql.org/message-id/1294953201-sup-2099@alvh.no-ip.org">I submitted my patch</a> to fix the foreign key lock problem. What I propose is almost exactly what was described in my previous blog posts, with a couple of differences.</p> <p>The first is that I decided to only check unique indexes, not all indexes. This is good because you can have other indexes for performance and they will not affect locking for foreign keys. I noticed that it could be done easily without performance impact, and it had been requested by at least two independent people.</p> <p>(For the hardcore hackers among you, the way I did this was by adding a second bitmapset to the relcache RelationData struct, which lists columns part of unique indexes. This second bms is computed in parallel of the bms used to do the HOT checks, and both are cached at the same time in the relcache entry, and the uniqueness info was already available there, so there are no extra catalog lookups to do this.)</p> <p>The other difference, as I commented on <a href="http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2__revisited/">a previous post</a>, is that <tt>FOR KEY LOCK</tt> does not conflict with <tt>FOR SHARE</tt>. This shouldn't be too problematic, although there is one catch: if a tuple is share-locked, and then there's a constant stream for key-lockers, someone trying to update the tuple might be locked out for a long time, even if the share-locker goes away.</p> <p>Now the patch is waiting for review on <a href="https://commitfest.postgresql.org/action/commitfest_view?id=9">the commitfest</a>. If you have a pet test case that could be affected by this patch, please try it out and let me know your thoughts and/or results.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2011/01/fixing_foreign_key_deadlocks__submitted/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/automatically_pulling_into_a_committers_git_repository/ Fri, 26 Nov 2010 06:08:24 -0700 Automatically pulling into a committer's Git repository http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/automatically_pulling_into_a_committers_git_repository/ <p>I was a bit unhappy because I couldn't keep my "bare" Git repository up-to-date unattended &mdash; I needed to be at my workstation to be able to do a <tt>git fetch</tt>, because it needs my SSH passphrase.</p> <p>I didn't have this problem with CVS because I kept an rsync copy of the anonymous CVS repository, from which my regular trees where checked out. (My committer's checkouts were separate, which was annoying, but I considered that problem solved with the jump to Git.)</p> <p>Yesterday I had an epiphany that this could be solved very easily: just add a new remote to the anonymous clone, which doesn't require any SSH key to be involved, and so can run unattended. This sounds quite trivial, but I couldn't make it work at first for reasons that appeared quite obscure; and indeed they were :-)</p> <p>The full solution looks like this:</p> <pre> git remote add anon-origin git://git.postgresql.org/git/postgresql.git </pre> <p>This adds the new remote, from which you can "git fetch anon-origin"; but while it will pick up the objects when you do, it won't update the branches. To make it update the branches, you have to fetch into each branch explicitely:</p> <pre> git fetch anon-origin \ master:master REL9_0_STABLE:REL9_0_STABLE \ REL8_4_STABLE:REL8_4_STABLE REL8_3_STABLE:REL8_3_STABLE \ REL8_2_STABLE:REL8_2_STABLE </pre> <p>Now I can have this in my crontab and be confident that the repository will be always reasonably up to date.</p> <p>Why this doesn't work without this trick is beyond me, but I don't really care all that much. I'm not into Git internals enough, it seems (and I don't think I want to be anyway).</p> <p>Of course, non-committers don't have this problem, because they can always run "git fetch" or "git pull" without worrying about being asked for a passphrase.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/automatically_pulling_into_a_committers_git_repository/ OpenSource PostgreSQL http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2__revisited/ Thu, 25 Nov 2010 12:02:59 -0700 Fixing foreign key deadlocks, part 2 revisited http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2__revisited/ <p>While trying to implement <tt>SELECT FOR KEY LOCK</tt> at the lock manager level, I stumbled across the problem that I need to represent the lock faithfully in the lock manager's terms. And since I previously mentioned that <tt>FOR KEY LOCK</tt> would conflict with <tt>FOR SHARE</tt>, I get in trouble &mdash; it's not easy to see which lock mode to use (if there is one, which I doubt).</p> <p>So I revisited that decision: a <tt>FOR KEY LOCK</tt> does not conflict with <tt>FOR SHARE</tt>, and this allows them to use the same ShareLock mode.</p> <p>This has two consequences: <ol> <li>After the tuple is locked by two transactions or more in the two different modes, there's no way to figure out which one has which lock.</li> <li>The <tt>HEAP_XMAX_SHARED_LOCK</tt> infomask bit needs to be carried forward in an <tt>UPDATE</tt>, just like <tt>HEAP_XMAX_KEY_LOCK</tt> bit was going to be.</li> </ol></p> <p>None of these is a problem, as far as I can see. Just a little bit different. But the basic fact that <tt>FOR SHARE</tt> and <tt>FOR KEY LOCK</tt> do not conflict could take someone by surprise.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2__revisited/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/ Thu, 18 Nov 2010 16:34:51 -0700 Fixing foreign key deadlocks, part 2 http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/ <p>In the <a href="http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/">previous article</a>, I explained the problem with foreign key checks checks obtaining too strong of a lock, and promised that we would be attempting to fix it.</p> <p>Here is my proposal: <ol> <li>Create a new <tt>SELECT</tt> locking clause. For now, we're calling it <tt>SELECT FOR KEY LOCK</tt> <li>This will acquire a new type of lock in the tuple, dubbed a "keylock". <li>This lock will conflict with <tt>DELETE</tt>, <tt>SELECT FOR UPDATE</tt>, and <tt>SELECT FOR SHARE</tt>. <li>It also conflicts with <tt>UPDATE</tt> if the <tt>UPDATE</tt> modifies an indexed attribute. </ol></p> <p>That's the gist of it. The end effect is that you are allowed to UPDATE a tuple that's being used in a foreign key check, as long as you don't change any indexed columns.</p> <p>This idea was suggested by <a href="http://archives.postgresql.org/message-id/1282730008.3865.47.camel@ebony">Simon Riggs</a> in the pgsql-hackers thread referenced in my previous article, and further debugged and improved by the developers in the ensuing discussion to a reasonably workable level &mdash; though it remains ticklish.</p> <p>The interesting implementation details are: <ol> <li>We need to use a new bit in t_infomask. <tt>0x0010</tt> is currently unused so we will grab that. <li>Key-locking a tuple means setting the <tt>XMAX_KEY_LOCK</tt> bit, and setting the Xmax to the locker. If the tuple is already key-locked, a MultiXactId needs to be created from the original locker(s) and the new transaction. <li>The original tuple needs to be marked with the Cmax of the locking command, to prevent it from being seen in the same transaction. <li>A non-conflicting update to the tuple must carry forward some fields from the original tuple into the updated copy. Those include <tt>Xmax</tt>, <tt>XMAX_IS_MULTI</tt>, <tt>XMAX_KEY_LOCK</tt>, and the <tt>CommandId</tt> and <tt>COMBO_CID</tt> flag. </ol></p> <p>If you're curious about also carrying forward <tt>COMBO_CID</tt>: at first I thought this wasn't necessary, because the only transaction that might care about those bits is the one creating the tuple, thus no transaction can do the necessary UPDATE. However, if a transaction creates the tuple, then modifies it in a subtransaction, then aborts the subtransaction, then key-locks the tuple, and finally updates it again, the last version of the tuple needs to have the correct <tt>CommandId</tt> information. This is fairly corner case and I would be surprised to see it happen in reality. But this is no excuse for not supporting the case.</p> <p>(Offhand, I don't see any other fields that need to be carried forward, but I'm open to the possibility that I'm missing some.)</p> <p>Note that the lock would be even more granular if instead of checking for an attribute of any index, we were to check for the particular <tt>UNIQUE</tt> index that implements the foreign key being verified. We choose not to do that for now, because it brings excessive modularisation breakage and possibly extra locking considerations.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/ Thu, 11 Nov 2010 13:58:19 -0700 Fixing foreign key deadlocks http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/ <p>I've been commissioned to work on foreign keys. More specifically, on the problem that when foreign keys are verified, they sometimes obtain locks that are stronger than really needed. This causes some operations to block unnecessarily and, perhaps more aggravating, some other operations to deadlock.</p><p>This problem has been known for a very long time, and it affects many users to varying degrees. The most recent detailed discussion about this problem took place on <a href="http://archives.postgresql.org/message-id/AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com">August 2010 on pgsql-hackers</a>.</p><p>To recapitulate on this problem a bit: in the aboriginal code, foreign key checks obtained <tt>FOR UPDATE</tt> locks on referenced tuples, meaning that they were exclusively locked for the duration of the transaction doing the check. This was so strong a lock that it had a severe impact to the performance of applications that expected to concurrently access and modify tables with foreign key relationships. Consequently, many people used to drop their foreign keys just to get a reasonable concurrency level.</p><p>We partly fixed this by introducing <tt>SELECT FOR SHARE</tt> in 8.1, which allowed checks to be run concurrently. This had an enormous positive impact to concurrency, so people then began to use foreign keys more extensively.</p><p>But when you start raising the load level, at some point another problem becomes apparent: the locks taken are a stronger than strictly necessary, causing pauses and sometimes deadlocks.</p><p>Joel Jacobson of Glue Finance illustrated it with an example in the post referenced above, which can be seen in action in <a href="http://screencast.com/t/NTk2Y2VhMW">this screencast</a>. His test case was this: <pre> DROP TABLE B; DROP TABLE A; CREATE TABLE A ( AID integer not null, Col1 integer, PRIMARY KEY (AID) ); CREATE TABLE B ( BID integer not null, AID integer not null, Col2 integer, PRIMARY KEY (BID), FOREIGN KEY (AID) REFERENCES A(AID) ); INSERT INTO A (AID) VALUES (1); INSERT INTO B (BID,AID) VALUES (2,1); Process 1: Process 2: BEGIN; BEGIN; UPDATE A SET Col1 = 1 WHERE AID = 1; UPDATE B SET Col2 = 1 WHERE BID = 2; UPDATE B SET Col2 = 1 WHERE BID = 2; UPDATE B SET Col2 = 1 WHERE BID = 2; </pre></p><p>In Joel's example, he was getting an unexpected deadlock when session 2 updated the row the second time. Why, he asked, wasn't the process getting blocked the first time around? His initial explanation was incorrect, but the underlying reason for his problem was that that transaction was getting a shared lock on the referenced row (the one on table A), which it really didn't need except to ensure that the row didn't go away &mdash; that is, to make sure the foreign key constraint remained satisfied until it could commit.</p> <p>Put simply, certain operations are blocked when there is no need for it. Consider this simple example:<br /> Session 1: <pre> CREATE TABLE foo (a int PRIMARY KEY, b text); CREATE TABLE bar (a int NOT NULL REFERENCES foo); INSERT INTO foo VALUES (42); BEGIN; INSERT INTO bar VALUES (42); </pre>Session 2: <pre> UPDATE foo SET b = 'Hello World' ; </pre></p><p>Note that session 2 is now blocked. And the reason it's blocked is pretty simple: session 1 is holding a shared lock on the row in table foo, and the <tt>UPDATE</tt> in session 2 wants to acquire an exclusive lock to be able to modify it.</p> <p>The <tt>pgsql-hackers</tt> discussion contained some very useful ideas on how to attack this problem, which is what I intend to do. If you've been affected by this problem and would like to discuss a solution, please let me know in a comment below. I'll be explaining my proposal in a forthcoming article.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/ PostgreSQL SQL http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/the_old_berkeley_postgres_code/ Tue, 09 Nov 2010 11:17:45 -0700 The old Berkeley Postgres code http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/the_old_berkeley_postgres_code/ <p>Some days ago, I was reading some patch from the in-progress commitfest, and happened to notice this comment in <tt>src/include/tcopprot.h</tt>:</p> <pre> * This file was created so that other c files could get the two * function prototypes without having to include tcop.h which single * handedly includes the whole f*cking tree -- mer 5 Nov. 1991 </pre> <p>The weird thing about this was that there's no tcop.h file on the tree. I thought that it must have been removed somewhere along the long history of code cleanups and rearrangements. I was curious to see what this file looked like, so I went to the very first commit in our CVS, which turns out to be <a href="http://git.postgresql.org/gitweb?p=postgresql.git;a=tree;hb=d31084e9d1118b25fd16580d9d8c2924b5740dff">this one in Git</a>. However, it turns out that it's not there either!</p> <p>So I turned to a quick web search, and found out that <a href="http://www.berkeley.edu">Berkeley</a> (or rather the University of California) still has <a href="http://db.cs.berkeley.edu/postgres.html">the old tarballs for anyone to grab</a>.</p> <p>I eventually found the file in the <tt>postgres-v4r0</tt> tarball; and as foretold in the old comment above, it clearly includes the whole source tree. Now that that file is long gone, I think it's time to remove that comment.</p> http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/the_old_berkeley_postgres_code/ OpenSource PostgreSQL