The Write–Ahead Log

WAL (acronym for Write–Ahead Log) is the mechanism that Postgres uses to implement durability of data changes in the face of abrupt system crashes.

The WAL system's central concept is that changes to data files must be written to such data files only after those changes are guaranteed to have been logged to WAL, that is, when WAL records have been flushed to permanent storage. (This flushing is accomplished by the fsync system call or equivalent mechanism.) If we follow this procedure, we need not flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using WAL — any changes that Postgres determines not to have been written will be redone from the WAL records.

One of the major benefits of using WAL is a significantly reduced number of disk flushes, since only the log file needs to be flushed to disk at the time of transaction commit; in multiuser environments, commits of many transactions may be accomplished with a single flush of the transaction log file. Furthermore, the log file is written sequentially, and so the cost of flushing it is much less than the cost of flushing the data pages, especially in instances where you have a lot of small transactions working on different parts of the database.

Checkpoints

Put simply, a checkpoint is a point in the transaction sequence at which all data files (where tables and indexes reside) are guaranteed to be written and flushed to disk to reflect the data modified in memory by regular operation. All those changes have previously been written and flushed to WAL.

There are two configuration parameters that determine the interval between checkpoints. The checkpoint_segments parameter determines the number of transaction log segments (16 MB in size apiece) that will be allowed to be filled before a checkpoint is forced. Similarly, at most checkpoint_timeout seconds will be allowed to elapse between one checkpoint and the next one, regardless of the number of segments in use.

The downside of having too much time/data between checkpoints is that in case of a crash, the time it takes to restore the data from the transaction logs is proportional to the number of segments to restore. Also, the space used in the pg_xlog/ subdirectory grows.

Spread checkpoints

Starting in Postgres 8.3, checkpoints are executed slowly, that is, instead of trying to write and flush pages as quickly as possible, an attempt is made to spread the writes so that the checkpoint itself does not consume all the available I/O bandwidth. Regular operation can, of course, continue while the checkpoint is in progress. In fact, users are happy with this delay behavior, because they observe that query time does not suffer from a latency spike during checkpoints, thanks to checkpointing leaving more I/O bandwidth available for them to use. Such latency spikes were quite noticeable in previous releases.

How much to spread checkpoints is configurable with the checkpoint_completion_target parameter; this value is a fraction of the checkpoint interval (in either number of segments or the timeout setting, whichever is sooner). For example, the default value of 0.5 means to take half the time of a checkpoint interval to do the writes.

Full page writes

Another consideration is that of “full page writes.” For the first change of a data file page (each 8kb unit) after a checkpoint, a full page image must be written to WAL. So the first change on a page after a checkpoint should be considered to be very expensive in terms of WAL space used, but subsequent ones before the next checkpoint occupy a comparatively small space in the log file.

This has one very important effect which is important to consider when configuring checkpoints: too small an interval between two checkpoints might mean that full page writes use almost all the available space in WAL before the next checkpoint, causing a new checkpoint to happen and thus a large number of full page images again. Increasing the checkpoint interval might mean a significantly reduced WAL traffic.

Full page writes can be turned off in postgresql.conf, but this is strongly discouraged; the circumstances in which it is safe to do so are extremely limited, and not completely understood.

References

All of the information found on this article can also be found in the official Postgres documentation: http://www.postgresql.org/docs/current/static/wal-configuration.html and links therein.