In a database production environment, a backup plays quite an essential role. The database server can fail for a number of reasons. It can happen because of hardware failure, software malfunction or just because of user error. Whatever the reason, when a live database goes down, a backup is essential in fixing and recovering it.
In a database system, the data is stored in binary files. Every database provider offers some kind of backup tools using which database files may be backed up. PostgreSQL database server also provides a comprehensive set of tools.
Logical and Physical Backups
In general, the backups are categorized into Logical and Physical backups. The logical backup are human readable backups. i.e. a user can see and understand the content of the data that’s in the backup files, e.g. SQL statements.
While the physical backups consist of the binary data. These backups are generally the filesystem snapshots of the database files. Both of these options have some pros and cons. In this blog, we’ll understand how we can use these and their pros and cons..
Logical Backups – SQL Dumps
SQL Dumps are the logical backups of your database. This type of backup creates a text file that consists of SQL statements. These statements when fed back to the database system will create the database in the same state when the backup was taken.
In PostgreSQL, we have pg_dump and pg_dumpall tools available to take this type of backup, while we have pg_restore to recreate the database from the backup.
pg_dump and pg_dumpall
The PostgreSQL database system can consist of multiple databases and pg_dump can take a backup of a single database at a time, while pg_dumpall can backup all databases at once.
The caveat with pg_dumpall is that, it will only take the backup in plain format and you cannot use the compression or some custom formats that are available with pg_dump for example if you wish to keep the backup in tarballs you can do that with pg_dump but not with pg_dumpall.
Additionally, In PostgreSQL some objects are global such as roles and tablespaces. pg_dump does not backup these while pg_dumpall does. So while taking a backup with pg_dump, the first step would be to take the backup of these globals.
pg_dump does offer much more options than pg_dumpall. Some of the important ones are:
- Multiple jobs
- Multiple formats
- Tar, and
# let's take a backup in plain format. pg_dump -U $USER_NAME -f $BACKUP_FILE.sql $DATABASE_NAME OR # simply redirect the output to a file. pg_dump -U $USER_NAME $DATABASE_NAME > $BACKUP_FILE.sql # backup with multiple workers. ( note that this option is only available for 'directory' format). pg_dump -U $USER_NAME --jobs $NO_JOBS —f $BACKUP_DIR $DATABASE_NAME # backup with compression enabled pg_dump —U $USER_NAME --jobs $NO_JOBS --compress $COMP_LVL —f $BACKUP_DIR $DATABASE_NAME
The restoration of the backup consists of the following steps.
# if a backup is in plain format psql -f $BACKUP_FILE.sql -d $DATABASE_NAME # If the backup is in custom, tar or directory format. pg_restore -U $USER_NAME -d $DATABASE_NAME -Fd $BACKUP_DIR
File System Level Backups
In this category, we copy all the files the PostgreSQL server uses to write the data in the database. However simply copying the files may not be a good thing! Consider that while copying the files, the database server may be writing data to these files at the same time. So the copy we have may be useless.
The PostgreSQL provides us with two options to handle this kind of situation and to ensure that the backup we have is a consistent one.
Manual Base Backups
PostgreSQL provides us with two low-level API’s pg_start_backup and pg_stop_backup for taking a consistent backup. We can take a consistent file system level backup by following these steps:
1. Enable the WAL archiving
To enable the WAL archiving, set the following parameters in the postgresql.conf file.
- wal_level = replica - archive_mode = on - archvie_command = 'cp %p $PATH_TO_ARCH_DIR/%f'
The %p is replaced with the pathname of the file to be archived and %f is replaced with the filename. You can put any of the shell commands that can copy the file to the desired destination, including ssh/scp or other variants.
Connect to the database as a privileged user and execute the function i.e.
SELECT pg_start_backup('label', false, false);
This will put the system in backup mode. The connection used for executing this function must be maintained during the backup or it will be aborted and the backup becomes unusable.
3. Backup database files
Once the above function is executed, you can start the copying of database server files. One can use any method to copy the file such as rsync or tar etc.
Once the copying of files is complete, you can execute the following:
SELECT * FROM pg_stop_backup(false, true); NOTICE: all required WAL segments have been archived lsn | labelfile | spcmapfile ------------+-----------------------------------------------------+------------ 0/1E000100 | START WAL LOCATION: 0/1B000028 (file 00000001000000000000001B)+| | CHECKPOINT LOCATION: 0/1B000060 +| | BACKUP METHOD: streamed +| | BACKUP FROM: primary +| | START TIME: 2020-08-31 19:57:28 PKT +| | LABEL: label +| | START TIMELINE: 1 +| | | (1 row)
This will terminate the backup and will return a single row with three columns. Using these values, we need to create a self-contained complete backup. This requires the following steps:
- Copy the second column data to a file named ‘backup_label’ in the backup directory.
- If the value for the third column is not empty, copy the value to a file named ‘tablespace_map’ in the backup directory.
- Include the required WAL files from the archive directory to the backup. We need to copy all the WAL files between starting wal location and ending wal location to the pg_wal directory in the backup.
- Starting WAL location
The first WAL file is identified by the second column in the above output; 00000001000000000000001B in this case
- Ending WAL location
Using the lsn value in the first column, we retrieve the complete filename by issuing the following query:
- Starting WAL location
postgres=# select pg_walfile_name('0/1E000100'); pg_walfile_name -------------------------- 00000001000000000000001E (1 row)
The above manual method of taking a backup provides more flexibility, however it’s also a little complicated to handle all the steps manually. The PostgreSQL also provides us with a tool named pg_basebackup. This tool also creates a file system level consistent backup using the above method internally.
However this tool uses the PostgreSQL replication protocol for connection and copying the files from the database system. So for it to work following configuration is needed.
1. Enable replication connection
In the pg_hba.conf file in PostgreSQL database, add the following lines.
# TYPE DATABASE USER ADDRESS METHOD # Allow replication connections from localhost, by a user with the # replication privilege. local replication all trust host replication all 127.0.0.1/32 trust
- You might want to replace ‘all’ with a specific user created for backup with REPLICATION attribute or use any superuser.
- Similarly, configure IP from where the connections can be made and use configure authentication methods.
(In order for changes in pg_hba.conf to take effect, you need to restart the database server).
Once a replication connection is set up, we can take the backup using the pg_basebackup utility. The pg_basebackup utility supports two formats for taking a backup along with compression and some other options.
# take a backup in tar mode. pg_basebackup -h $DB_SRV_IP -p $PORT -U $USER_NAME -Ft -D $BACKUP_DIR OR # take a backup in plain mode. pg_basebackup -h $DB_SRV_IP -p $PORT -U $USER_NAME -Fp -D $BACKUP_DIR
This command will take a complete standalone backup and will store the backup in $BACKUP_DIR directory.
The backup created with pg_basebackup is a self contained backup and can be used as is. If it’s a tar backup:
- extract the base.tar
- extract the pg_wal.tar and copy the WAL files to pg_wal directory.
- configure the database server to use this directory as its base. I.e.
pg_ctl -D $BACKUP_DIR start
When to choose one over the other?
There is no clear way to select one tool over the other. They both have their advantages and disadvantages for example:
- SQLDumps are great if you want to upgrade database versions.
- They provide more control over backup like you can select the object you want to backup instead of backing up the whole database.
- Backups are smaller since no index information is backed up.
However, it does have downsides including:
- They increase the restoration time since the whole database will be rebuilt from the ground up.
- Although we can take the backup without any downtime. But it may hinder the performance since it can introduce lock contentions during the backup.
Similarly the physical backups have their own advantages
- Faster restore time
- No lock contentions
- Binary backup, so no possibility to change the data in backup.
- The complete database system is backed up. So it’s like the exact copy of the database.
- Physical backups are version-specific. they cannot be used with different versions of database systems.
- Database size can be large
- Not much flexibility is available. The entire system is backed up and restored. There is no selection of some objects etc.
So choosing one way of backing up your PostgreSQL data over the other is a decision that must be taken based on the specific use case. Both options have their pros and cons, so whichever fits your use case better.
Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.