A pg_basebackup Wish List

pg_basebackup was introduced in Postgres 9.1 as a simple way to copy the data directory of a running database cluster. This was a great addition to a small group of PostgreSQL Client Applications.

The pg_basebackup approach differs from the standard pg_start_backup(), rsync (or other file system copy), pg_stop_backup() approach in that it uses the replication protocol over a standard Postgres connection to make the base backup. A few highlights:

A completely standalone backup can be created easily by using the --xlog-method argument (the stream option here is particularly nice so that you don’t have to muck with the wal_keep_segments Postgres setting).
Thanks to the cascading replication machinery in 9.2 pg_basebackup can take a backup from an active hot standby server (where the basic pg_start_backup() approach fails).
As of 9.3 pg_basebackup will even create a recovery.conf file for you (with the --write-recovery-conf switch), so it’s basically a one-liner to get a standby with replication set up!

There’s a lot to like, but I’m greedy....

Wish #1: Parallelization

If “idle hands are the devil’s playthings” I can only imagine what the devil would do with idle CPUs and network bandwidth. It would be great if pg_basebackup were able to parallelize creation of the base backup and make use of more available resources. In the past I’ve written a script to parallelize rsync by first processing its --itemize-changes output and then spawning multiple rsync processes in parallel to perform the copies; this parallelization can really cut down on the time to get a base backup copied. There are times when getting something done as fast as possible trumps other concerns.

Wish #2: Compression Options

Sometimes network bandwidth is in short supply. pg_basebackup has a --gzip switch to enable compression, but this compression is actually done on the client machine rather than the server. Certainly there is the option to run pg_basebackup from the originating server and use netcat (or scp, etc.) to get the data copied to the destination if we are using pg_basebackup’s --format=tar (-Ft) option. For example, on the originating server:

nierman@source:~$ pg_basebackup -D - -Ft -x | pbzip2 -p8 | nc -l 1201

(in an actual use case I'd also insert pv in the pipeline to get nice status/progress updates at various stages), and on the destination server:

nierman@destination:~$ nc source 1201 > newbasedir/base.tar.bz2

This process is a bit more complicated than a single command from the client/destination server, but more importantly we lose out on things like streaming WAL files when using the tar format with pg_basebackup. Note that I’ve also used pbzip2 above (a parallelized version of bzip2). It would be great to have the option to use external programs like pbzip2, pigz, etc. with pg_basebackup compression. Let’s call that Wish #2b.

Wish #3: Don’t Die on Me Now!

Here’s some output from a recent pg_basebackup session:

pg_basebackup: starting background WAL receiver
2052690927/2052718105 kB (100%), 1/1 tablespace
pg_basebackup: could not get transaction log end position from server:
FATAL:  could not open file "./#postgresql.conf#": Permission denied

pb_basebackup chunked through almost 2TB of data and then it died just before finishing. Ack. Why? At some point someone had edited the postgresql.conf file as root and their editor saved the backup copy seen above (in the postgres data directory of course!); the Postgres backend couldn’t read that file (owned by root) when creating the base backup. I would have loved to see a simple error message and have pg_basebackup continue on its merry way. Certainly an initial ls -l in the data directory is a good quick check that can be done manually, although that wouldn’t save me from people that hide their cat photos deep inside the Postgres base directory.

Wish #1: Parallelization

Wish #2: Compression Options

Wish #3: Don’t Die on Me Now!

Recent blogs

Is Trust Gone? Supply Chain Attacks

What If Your Team’s Biggest Burnout Driver Lives in Your Database?

Why You Should Review Your Authentication Strategy

PgManage 1.4 – SQL Server Support, Faster Interface Navigation, Spreadsheet‑like Data Grids & More!

PostgreSQL Version 13 Reaches End of Life: Migration and Extended Support Strategies

Cost Optimization for PostgreSQL: Practical Tips for Technical Teams

PostgreSQL 18: Revolutionary Performance Boost Now Available

Why Growing Teams Are Moving from Aurora to RDS or EC2: Cost and Control Considerations on AWS

A pg_basebackup Wish List

Wish #1: Parallelization

Wish #2: Compression Options

Wish #3: Don’t Die on Me Now!

You may also like

Why Postgres? (How did I get here?)

Meet Ivan Lezhnjov: Command Prompt's Open Source Detective

Meet Eugene Dubinin: Command Prompt's Postgres Sailor

Meet Andrew Smith: Command Prompt's Australian Altruist

Meet Debra Cerda: Command Prompt's Ranged Attack Expert

A Tutorial on Logical Replication

Recent blogs