An Overview of Replication in PostgreSQL Context

Enterprise PostgreSQL Solutions

Comments are off

An Overview of Replication in PostgreSQL Context

Replication is a critical part of any database system that aims to provide high availability (HA) and effective disaster recovery (DR) strategy. This blog is aimed at establishing the role of replication in a database system. The blog will give a general overview of replication and its types as well as an introduction to replication options in PostgreSQL.

The term Replication is used to describe the process of sharing information between one or more software or hardware systems; to ensure reliability, availability, and fault-tolerance. These systems can be located in the same vicinity, could be on a single machine or perhaps connected over a wide network. The replication can be divided broadly into Hardware and Software categories. We’ll explore these categories briefly, however, the main focus of this blog is database replication. So, let’s first understand what constitutes a database replication system.

In a nutshell, the database replication describes the process of coping data from the database instance to one or more database instances. Again, these instances can be in the same location or perhaps connected over a wide network.

Hardware-Based Replication

Let’s start with the hardware replication. Hardware-based replication keeps the multiple connected systems in sync. This syncing of data is done at the storage level as soon as an I/O is performed on the system, it’s propagated to the configured storage modules/devices/systems. This type of replication can be done for the entire storage or the selected partitions. The biggest advantage of this type of solution is that (generally) it’s easier to set up and is independent of software. This makes it perform much better, however, it reduces the flexibility and control over the replication. Here are the few pros of this type of replication.

Real-time – all the changes are applied to subsequent systems immediately.
Easier to Setup – no scripting or software configurations are required.
Independent of Application – replication happens at the storage layer and is independent of OS/software application.
Data Integrity and Consistency – As the mirroring happens at the storage layer, which effectively makes an exact copy of the storage disk, So the data integrity and consistency are automatically ensured.

Although hardware replication has some very appealing advantages yet it comes with its own limitations. It generally relies on the vendor locking i.e. the same type of hardware has to be used and often it’s not very cost-effective.

Software-Based Replication

This replication can range from generic solutions to product-specific solutions. The generic solutions tend to emulate the hardware replication by copying the data between different system at the software level. The software that is responsible for performing the replication, copies each bit written to the source storage and propagates it to the destination system(s). Whereas the product-specific solutions, are more considerate towards the product requirements and are generally meant for a specific product. The software-based replication has its pros and cons. On one hand, It provides flexibility and control over how the replication is done and is usually very cost-effective while provides a much better set of features. But on the other hand, it requires a lot of configurations and requires continuous monitoring and maintenance.

Database Replication

Having discussed replication and its different types; lets now turn our focus toward the database replication topic.

Before going into more details lets start by discussing the different terminologies used to describe the different components of a replication system in the database world. Primary-Standby, Master-Slave, Publisher-Subscriber, Master-Master/Multimaster are the most often used terms to describe the database servers participating in the replication setup.

The term Primary, Master and Publisher are used to describing the active node that strives to propagate changes received by it to the other nodes. Whereas Standby, Slave and Subscriber terms are used to describe the passive nodes that will be receiving the propagated changes from the active nodes. In this blog, we will use Primary to describe the active node and Standby for the passive node.

The database replication can be configured in Primary-Standby and Multi-master configurations.

In a Primary-Standby configuration, only one instance receives the data changes and then it distributes them among the standbys.

Primary-Standby Configuration

Whereas in the Multi-master system, each instance of a database can receive the data change and then propagates that change to other instances.

Multi-Master Configuration

Synchronous/Asynchronous Replication

At the core of the replication process is the ability of the primary node to transmit data to standbys. Much like any other data transfer strategy, this can happen in a synchronous way where primary waits for all standbys to confirm that they have received and written the data on disk. The client is given a confirmation when these acknowledgments are received. Alternatively, the primary node can commit the data locally and transmit the transaction data to standbys whenever possible with the expectation that standbys will receive and write the data on disk. In this case, standbys do not send confirmation to the primary. Whereas the former strategy is called synchronous replication, this is referred to as asynchronous replication. Both hardware and software-based solutions support synchronous and asynchronous replication.

Following are diagrams that show both types of replication strategies in a graphical way.

Since the main difference between the two strategies is the acknowledgment of data written on the standby systems; there are advantages and disadvantages in using both techniques. Synchronous replication may be configured in a way such that all systems are up-to-date and that the replication is done in real-time. But at the same time, it adversely impacts the performance of the system since the primary node waits for standbys’ confirmations.

The asynchronous replication tends to give better performance as there is no wait time attributed to confirmation messages from standbys.

Cascading Replication

So far we have seen that only the primary node transmits the transaction data to standbys. This is a load that can be distributed among multiple systems; i.e. a standby that has received the data and written on disk can transmit it onwards to standbys that are configured to receive data from it. This is called cascading replication where replication is configured between primary and standbys, and standby(s) and standbys. Following is a visual representation of cascading replication.

Cascading Replication

Standby Modes

Warm standby is a term used to describe a standby server that allows standby to receive the changes from primary but does not allow any of the client connection to it directly.

Hot standby is a term used to describe a standby server which also allows accepting the client connections.

PostgreSQL Replication

In the end, I would like to share the replication options available for PostgreSQL.

PostgreSQL offers built-in log streaming and logical replication options. The built-in replication is only available in the primary-standby configuration.

Streaming Replication (SR)

Also known as physical and binary replication in Postgres. Binary data is streamed to the standbys. In order to stream such data, streaming replication uses the WAL (write-ahead log) records. Any change made on the primary server is first written to the WAL files before it is written on disk and database server has the capability to stream these records to any number of standbys. Since its a binary copy of data, there are certain limitations to it as well. This type of replication can only be used when all of the changes have to be replicated on the standbys. One cannot use this if only a subset of changes is required. The standby servers either do not accept the connections or if they do, they can only serve the read-only queries. The other limitation is that it can not be used with a different version of the database server. The whole setup, consisting of multiple database servers has to be of the same version.

Streaming replication is asynchronous by default, however, it not much difficult to turn on the synchronous replication. This kind of setup is perfect for achieving high availability (HA). If the main server fails, one of the standbys can take its place since they are almost the exact copy of it. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser    <ip address range>      trust

# Take a base backup on the stanby:
pg_basebackup -h <primary-ip> -U foouser -D ~/standby

# In postgresql.conf following entries are needed
wal_level=replica # should be set to replica for streaming replication
max_wal_senders = 8 # number of standbys to allow connection at a time

# On the stanby, create a recovery.conf file in its data direcrtory with following contents:
primary_conninfo='user=foouser host=<primary-ip> port=<primary-port>'

Now start the servers and a basic streaming replication should work.

Logical Replication

Logical Replication is considerably new in PostgreSQL as compared to the streaming replication. The streaming replication is a byte-by-byte copy of the entire data. It, however, does not allow replication of a single table or a subset of the data from a primary server to standbys. Logical replication enables replicating particular objects in the database to the standbys, instead of replicating the whole database. It also allows replicating data between different versions of the database servers and not only that it also allows the standby to accept both read and write queries. However, be watchful of the fact that there is no conflict resolution system implemented. So if both primary and standbys are allowed to write on the same table, more likely than not, there will be data conflicts that will stop the replication process. In this case, the user must manually resolve these conflicts. However, both standbys and primary may allow writing on complete disjointed sets of data which will avoid any conflict resolution needs. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser    <ip address range>      trust

# For logical replication, base backup is not required, any of the two database instances can be used to create this setup.

wal_level=logical # for logical replication, wal_level needs to be set to logical in postgresql.conf file.

# on primary server, create a some tables to replicate and a publication that will list down these tables.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);
INSERT INTO t1 ....;
INSERT INTO t2 ....;

# on the standby, the structure of above mentioned tables needs to be created as DDL is not replicated.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);

# on standby, create a subscription for the above publication.
CREATE SUBSCRIPTION foosub CONNECTION 'host=<primary-ip> port=<primary-port> user=foouser' PUBLICATION foopub;

With this a basic logical replication is achieved.

I hope this blog helped you understand replication in general and from PostgreSQL perspective streaming and logical replications concepts. Furthermore, I hope this helps you in better designing a replication environment for your needs.