This blog was written to help beginners understand and set up server replication in PostgreSQL using failover and failback. Much of the information found online about this topic, while detailed, is out of date. Many changes have been made to how failover and failback are configured in recent versions of PostgreSQL. In this blog, I want to not only go over the steps on how to set up failover and failback but also the common pitfalls that may arise. Let’s start first with a brief definition of these recovery methods we will be using.
A failover operation is when either some unplanned error or planned maintenance occurs, we switch our main server over to the replica. This is typically done with a tool like
pg_ctl promote. The main server is taken down and an up-to-date replica takes over as the new master. This allows for more availability in the case of failure of maintenance of the server. Once done the old main severe can be promoted back.
A failback operation is when we copy required data from a source to a target database to resume mirroring. This is to ensure both databases are synchronized so that mirroring can continue. This is typically issued after a Failover operation to restart mirroring in either direction, usually using a tool like
For this tutorial, we will assume that you have PostgreSQL installed and know how to set up physical replication with streaming. All of these steps were performed using PostgreSQL 16 (development version) on Ubuntu 22.10. We use
$HOME/pg/ as our working folder with
$HOME/pg/archive as our primary data, replica data and archive directory respectively.
Throughout this blog we will make reference to different stages of the primary and replica server. Due to the dynamic nature of the topic we are discussing this can get quite confusing as primary servers change to replicas and then back again. To remedy this I will briefly summarize the different names of the server states I use.
- Primary/Main Server: This is the server that we begin with, the replica is made of this server. Later on, it is referred to as the “old main/primary” or “new replica/secondary/standby” server but it is still the same server.
- Replica/Secondary/Standby Server: This server is the second one made, used to replicate the primary server. Later on, it is referred to as the “new main/primary” or “old replica/secondary/standby” server.
Using Failover and Failback to Restore a Database
Start off by creating a primary server. Before starting your primary server, add the following to your primary server’s
archive_mode = on wal_level = hot_standby max_wal_senders = 10 hot_standby = on wal_log_hints = on archive_command = 'cp %p $HOME/pg/archive/%f' restore_command = 'cp $HOME/pg/archive/%f %p'
Once this is done continue with setting up the replication server like usual.
If you followed the replica server setup correctly then it should be in sync with the primary. Once this is confirmed, promote this standby server to become the main server.
$ pg_ctl promote -D $HOME/pg/rep waiting for server to promote.... done server promoted
Failover In Action
This is the Failover part of the process as the replica is now acting as the primary server. If our main database is corrupted or crashed, we can use our replica to ensure as little downtime as possible. But just because we set this replica as the main doesn’t mean our old main server isn’t still receiving requests. This is a real issue that can cause problems later on, so let’s add some data to simulate this happening so we can see how to fix it.
postgres=# insert into t1 values (7,8); INSERT 0 1 postgres=# select * from t1; a | b ---+--- 1 | 2 3 | 4 5 | 6 7 | 8 (4 rows)
Checking the new main server hasn’t been altered:
postgres=# select * from t1; a | b ---+--- 1 | 2 3 | 4 5 | 6 (3 rows)
So we can see the servers are now out of sync. PostgreSQL often says that their “timelines” have “diverged”. Let’s shutdown the old main server:
$ pg_ctl -D $HOME/pg/data stop waiting for server to shut down.... done server stopped
Then we need to update the old main server’s
recovery_target_timeline = 'latest' primary_conninfo = 'user=postgres host=localhost port=5433' primary_slot_name = 'main' port = 5432
The slot “main” does not exist yet, so we will create it in the new main server:
$ psql -d postgres -U postgres -c "select * from pg_create_physical_replication_slot('main');" -p 5433 slot_name | lsn -----------+----- main | (1 row) $ psql -d postgres -U postgres -p 5433 -c "select * from pg_replication_slots;" -x -[ RECORD 1 ]-------+--------- slot_name | main plugin | slot_type | physical datoid | database | temporary | f active | f active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | f
Also, don’t forget that our old main is the new standby so we have to create a dummy
$ touch $HOME/pg/data/standby.signal
Now let’s try and start up the old main server:
$ pg_ctl -D $HOME/pg/data start waiting for server to start....2023-04-03 11:43:00.115 PDT  LOG: starting PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 12.2.0-3ubuntu1) 12.2.0, 64-bit 2023-04-03 11:43:00.115 PDT  LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-04-03 11:43:00.116 PDT  LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-04-03 11:43:00.118 PDT  LOG: database system was shut down in recovery at 2023-04-03 11:42:38 PDT cp: cannot stat '/home/tristen/pg/archive/00000002.history': No such file or directory cp: cannot stat '/home/tristen/pg/archive/00000003.history': No such file or directory 2023-04-03 11:43:00.120 PDT  LOG: entering standby mode cp: cannot stat '/home/tristen/pg/archive/00000002.history': No such file or directory cp: cannot stat '/home/tristen/pg/archive/000000020000000000000003': No such file or directory cp: cannot stat '/home/tristen/pg/archive/000000010000000000000003': No such file or directory 2023-04-03 11:43:00.122 PDT  FATAL: requested timeline 2 is not a child of this server's history 2023-04-03 11:43:00.122 PDT  DETAIL: Latest checkpoint is at 0/3000508 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/30002E0. 2023-04-03 11:43:00.123 PDT  LOG: startup process (PID 69833) exited with exit code 1 2023-04-03 11:43:00.123 PDT  LOG: aborting startup due to startup process failure 2023-04-03 11:43:00.123 PDT  LOG: database system is shut down stopped waiting pg_ctl: could not start server Examine the log output.
Uh-oh, looks like our server can’t start! Pay close attention to this line in particular:
2023-04-03 11:43:00.122 PDT  FATAL: requested timeline 2 is not a child of this server's history
Failback In Action
This indicates that our old main server cannot start as a replica of the new main because they are out of sync. How do we get them back into sync? This is where the Failback portion comes in! First, we have to ensure that our archive is restored. We can either rely on our
restore_command or do it manually like so:
cp -n $HOME/pg/archive/* $HOME/pg/data/pg_wal/
pg_rewind will not work unless we have the correct archives restored. Once that is done, we can rewind, restoring our old main to its original condition. Be careful, as this will delete any data on the old main after it diverged and also copy over the
postgres.conf file of the new main. So if you have specific ports or configs you want to use remember to change them after issuing
$ pg_rewind --target-pgdata=data --source-server="port=5433 user=postgres dbname=postgres" pg_rewind: servers diverged at WAL location 0/3000000 on timeline 1 pg_rewind: rewinding from last common checkpoint at 0/2000060 on timeline 1 pg_rewind: Done!
We’ll also add the following to our
postgresql.conf because it was overwritten by the new main:
primary_slot_name = 'main' recovery_target_timeline = 'latest' primary_conninfo = 'user=postgres host=localhost port=5433 sslmode=prefer sslcompression=0 gssencmode=disable target_session_attrs=any' port = 5432
primary_conninfo = 'user=postgres host=localhost port=5433 sslmode=prefer sslcompression=0 gssencmode=disable target_session_attrs=any' port = 5432
Now we should be able to start the old main server:
$ pg_ctl -D $HOME/pg/data -l data.log start waiting for server to start.... done server started
Once started, let’s check to make sure its contents are actually rolled back:
$ psql -d postgres -c "select * from t1;" a | b ---+--- 1 | 2 3 | 4 5 | 6 (3 rows)
As we can see the tuple
(7,8) that was inserted after the replica was promoted is no longer in the table. Therefore, our database was successfully rolled back.
Now our old main server is in sync with the new main server and can be used as a replica. If your old main server is not syncing with the new main server this could be due to
postgresql.auto.conf in our old main server’s data directory. This file overwrites what we placed into
postgresql.conf with some defaults, so either delete this file or append the correct
port number to it.
Restoring the Old Main Server
Now let’s say we want to restore our old main server (the current replica) to be the main server of the cluster (as it was originally). If we don’t write any data to our current main server after we promote the replica, then this is simple as we won’t need
So first, stop the new main server:
$ pg_ctl -D $HOME/pg/rep stop
Then promote the new replica:
$ pg_ctl -D $HOME/pg/data promote
Don’t forget to add the
standby.signal dummy file to the original replica:
$ touch $HOME/pg/rep/standby.signal
Now, let’s start up the replica:
$ pg_ctl -D $HOME/pg/rep -l rep.log start
We should see the following line somewhere in our logfile to let us know that this replica server has started in standby mode:
2023-04-04 14:30:04.532 PDT  LOG: entering standby mode
But before we can use the replica server we have to update our replica slots. Specifically, we need to create a replica slot on our old main serve and drop the existing “main” replica slot on the old replica server.
$ psql -U postgres -c "select * from pg_create_physical_replication_slot('replica');" -p 5432 $ psql -U postgres -c "select * from pg_drop_replication_slot('main');" -p 5433
Now we should see any values inserted into our original primary database be replicated in the original replica server, just like they were before!
$ psql -d postgres -c "insert into t1 values (0,0);" -p 5432 INSERT 0 1 $ psql -d postgres -c "select * from t1;" -p 5433 a | b ---+---- 1 | 2 3 | 4 5 | 6 9 | 10 0 | 0 (5 rows)
And that’s it! We’ve successfully recovered our original database by first falling back to a replica, restoring the primary, turning the primary into a replica for the server that replaced it, then restoring the primary to its original state.
Broccolo, G. (2016, September 12). [web log]. Retrieved April 4, 2023, from https://www.2ndquadrant.com/en/blog/introduction-to-pgrewind/.
Huang, C. (2019, November 7). [web log]. Retrieved April 4, 2023, from https://www.highgo.ca/2019/11/07/replication-failover-with-pg_rewind-in-pg12/.
Jamison, C. (2018, April 10). [web log]. Retrieved April 4, 2023, from https://www.scalingpostgres.com/tutorials/postgresql-replication-failback-pg-rewind/.
Tristen received his Bachelor of Applied Science in Computer Engineering from the University of British Columbia in May 2023. He joined HighGo Software Inc. as a Software Engineer fresh out of university and is very excited for his engineering journey to begin. His main interests include Machine Learning, Embedded Systems, and Database Management Systems. With experience in C/C++ and advanced relational databases, Tristen hopes to contribute significantly to the PostgreSQL community as he continues to learn and grow his expertise.