This blog is aimed at beginners trying to learn the basics of PostgreSQL and HAProxy but already have some experience under their belt. For this tutorial, we will assume you have PostgreSQL correctly installed on Ubuntu. All of these steps were done using PostgreSQL 16 (development version) and HAProxy 2.6.9 on Ubuntu 23.04. We’ll go over an example of setting up multiple PostgreSQL servers, one primary and a few replicas, and how they can be balanced using HAProxy.
For our setup, we want something like the image below.
A user will use psql to interact with the Postgres database. We’ll have 1 primary database to handle both read and write requests, and two replicas that will handle read requests as well as any failovers that may occur. HAProxy sits in the middle of the two acting as a load balancer. So when the user connects to the database, they do not know which specific database they have reached but it shouldn’t matter as long as all the data is replicated correctly.
We can then set this up with a few easy commands. First, we need to create the primary:
$ mkdir data1/ $ initdb data1/
Don’t forget to edit postgresql.conf and pg_hba.conf with the correct connection information. For this example, we are using port 10001 for the primary and 10002, 10003, 10004 for the replicas. Look at our previous blog here if you want to learn more about setting up replicas locally. Start the primary server once this configuration is done.
Now for our two replicas:
$ mkdir data2/ data3/ $ pg_basebackup -h localhost -p 10001 -U rep_user -X stream -C -S replica_1 -v -R -W -D $HOME/Documents/work/pg_test/data2/ $ pg_basebackup -h localhost -p 10001 -U rep_user -X stream -C -S replica_2 -v -R -W -D $HOME/Documents/work/pg_test/data3/
Don’t forget to also edit the configuration files of these replicas so they have the correct port number. Once you have done this feel free to start the standby servers and check that they are replicating the primary data.
All we need now is HAProxy. Here is an example of our HAProxy configuration file for what we are trying to implement:
global stats socket ipv4@:9999 level admin defaults mode tcp timeout connect 20s timeout client 10m timeout server 10m frontend front bind localhost:9998 use_backend back backend back balance roundrobin server-template pg 1-10 localhost:10001
Let’s look at this config file line by line:
- Lines 1-2: This is the global section that applies to everything in our config. Here we have defined a statistics socket on port 9999 which will be used for HAProxy API commands in order to dynamically assign new servers.
- Lines 4-8: This is the default section that sets what values should be used when no specifications are made. Here we define our default mode to be TCP (instead of HTTP), which means all of our connections will be TCP since that is what psql uses. We also define some timeouts which is mainly so we don’t get warning messages when starting HAProxy.
- Lines 10-12: This is our frontend, bound to port 9998. Here HAPRoxy will listen for TCP connections on port 9998 and forward them to the backend named “back“.
- Lines 14-16: This is our backend which has 10 servers defined and balanced by a round-robin algorithm. The 10 servers are defined using server-template, which means that we have 10 servers available to be dynamically allocated during HAProxy’s runtime. This number is important as it should be all the servers we plan to need for the lifetime of our application. Changing this number would involve editing the configuration and restarting HAProxy.
To check that our configuration is valid we can run the following:
$ haproxy -f /etc/haproxy/haproxy.cfg -c Configuration file is valid
Then to start HAProxy depends on your system. Since I am running Ubuntu 23.04 I have two choices.
$ systemctl start haproxy.service
$ service haproxy start
Both will work on Ubuntu, but make sure to select the correct method for your operating system.
Once HAProxy is running, we need to activate our backend servers by assigning them to the correct ports that have our databases. We can do this by sending an HAProxy runtime API command over TCP to the port we defined as the global stats port in our configuration file. Here we will be using socat to deliver the command.
$ echo "set server back/pg1 addr 127.0.0.1 port 10001" | socat stdio tcp4-connect:127.0.0.1:9999 $ echo "set server back/pg2 addr 127.0.0.1 port 10002" | socat stdio tcp4-connect:127.0.0.1:9999 $ echo "set server back/pg3 addr 127.0.0.1 port 10003" | socat stdio tcp4-connect:127.0.0.1:9999
This will assign 3 of our 10 servers to ports 10001, 10002, and 10003 which match exactly the ports we specified earlier would have our databases running on them. Now whenever HAProxy receives a TCP request, it will forward it to the backend which is a pool of 3 servers.
We can get a full list of commands that the HAProxy runtime API supports by running:
$ echo "help" | socat stdio tcp4-connect:127.0.0.1:9999
Dynamically add a new server
Now that we have our initial setup running, let’s say our select traffic shoots up and we need another server to handle it. Create another Postgres instance that will act as our third replica.
$ mkdir data4/ $ pg_basebackup -h localhost -p 10001 -U rep_user -X stream -C -S replica_3 -v -R -W -D $HOME/Documents/work/pg_test/data4/
Like before, we can send a message to a running HAProxy instance to add new servers without restarting it:
$ echo "set server back/pg4 addr 127.0.0.1 port 10004" | socat stdio tcp4-connect:127.0.0.1:9999
This will assign one of our 10 server templates to the assigned address and port. This gives us a total of 4 servers in our pool.
echo "show servers state" | socat stdio tcp4-connect:127.0.0.1:9999 to see our server states:
$ echo "show servers state" | socat stdio tcp4-connect:127.0.0.1:9999 1 # be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord srv_use_ssl srv_check_port srv_check_addr srv_agent_addr srv_agent_port 3 back 1 pg1 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 2 pg2 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10002 - 0 0 - - 0 3 back 3 pg3 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10003 - 0 0 - - 0 3 back 4 pg4 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10004 - 0 0 - - 0 3 back 5 pg5 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 6 pg6 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 7 pg7 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 8 pg8 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 9 pg9 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0 3 back 10 pg10 127.0.0.1 0 5 1 1 371 1 0 0 14 0 0 0 localhost 10001 - 0 0 - - 0
As we can see our 3 backends have been assigned.
Now we can use psql to manage our database and have our select queries distributed among our 4 servers.
Note that this setup only works if we want to distribute read requests through HAProxy , like SELECT statements. This is because HAProxy does not care about the contents of the TCP connection, it simply sees one a forwards it based on round robin. If we issue a write query that modifies data we run into a problem. Replica servers cannot serve these write requests and will return an error. Since 3 of our 4 servers are replicas it becomes obvious that this will not work unless we plan on never editing our data.
Let’s look at some small improvements we can make to this example.
The first improvement I would want to make would be separating the primary and standby databases into different frontends so we get a port for each. Now we can use HAProxy to modify data by connecting to the primary frontend and serving all of our read requests through the replica frontend.
This HAProxy configuration might look like the following:
global stats socket ipv4@:9999 level admin defaults mode tcp timeout connect 20s timeout client 10m timeout server 10m frontend primary bind localhost:9997 use_backend primary_backend backend primary_backend server pg-primary localhost:10001 frontend replica bind localhost:9998 use_backend replica_pool backend replica_pool balance roundrobin server-template pg-replica 1-10 localhost:10002
With two frontends we no longer need to activate our primary like before as it is an already defined server. Our replica pools can still be dynamically allocated, allowing us to add or remove servers based on demand.
The second improvement I would suggest would be automating the allocation process by developing our own program which monitors the database load. This program would be its own server that periodically checks the average load of all the replicas and decides based on this number whether or not to create a new Postgres database and add it to the replica pool. HAProxy’s runtime API makes this very easy as all we need is a TCP library which most languages have. We also need a way to create a new Postgres database which could be done using system commands if no PostgreSQL library exists for this language.
In this blog, we went over an example of dynamically adding new replica servers to our primary PostgreSQL database using HAProxy. Having the ability to replicate data across servers serves as not only a defense against data loss from crashes but also as a way to improve data availability. Being able to add replicas dynamically then becomes critical for any application that relies on said data being consistently accessible. In conclusion, if you’re designing some new software architecture, don’t forget to think about how replication can play into the robustness of your design.
Tristen received his Bachelor of Applied Science in Computer Engineering from the University of British Columbia in May 2023. He joined HighGo Software Inc. as a Software Engineer fresh out of university and is very excited for his engineering journey to begin. His main interests include Machine Learning, Embedded Systems, and Database Management Systems. With experience in C/C++ and advanced relational databases, Tristen hopes to contribute significantly to the PostgreSQL community as he continues to learn and grow his expertise.