Experimental load balance setup based on shared-storage using pgpool and neon serverless

Enterprise PostgreSQL Solutions

Comments are off

Experimental load balance setup based on shared-storage using pgpool and neon serverless

1. Overview

This blog describes a very experimental load balance using pgpool and neon compute nodes which are based on Postgres. There are many load balance discussions using different Postgres distributions, but they are mainly based on shared-nothing. The load balance setup discussed here is based on shared-storage using neon serverless solution.

2. Setup storage and compute nodes

To better understand the setup process, this blog will demonstrate the setup using binary built on Ubuntu 22.04, and we can discuss the docker build setup in a separate blog if more people are interested in this topic.

2.1. Build binaries

First, refer to Building on Linux to install all the build dependencies. Before building neon binaries, you need to build protobuf to support optional parameter. Here are the steps I used to workaround the build issues:

$ git clone https://github.com/protocolbuffers/protobuf.git
$ cd protobuf
$ git submodule update --init --recursive
$ cmake -Dprotobuf_BUILD_SHARED_LIBS=ON .
$ cmake --build . --parallel 4
$ sudo make install
$ protoc --version
libprotoc 22.3

After installing protobuf, follow Building on Linux to recursively clone the repo, and then checkout a special commit/release. You may not be able to set up the hot standby replica if you use a different commit.

$ git clone --recursive https://github.com/neondatabase/neon.git
$ cd neon
$ git checkout remotes/upstream/releases/2023-05-02 -b hot-standby-replica
$ make -j`nproc` -s

2.2. Start storage and Primary compute nodes

The storage and compute nodes build will take a while. After all binaries build successfully, then run the below commands:

$ ./target/debug/neon_local init
$ ./target/debug/neon_local start
$ ./target/debug/neon_local tenant create --set-default
$ ./target/debug/neon_local endpoint start main
$ ./target/debug/neon_local endpoint list

After the above commands, you will have a main branch started as the Primary Postgres compute node.

2.3. Prepare settings for hot standby

Since neon shared-storage serverless relies on a specification file to manage the compute nodes at this point, we must duplicate the original spec.json and make slight modifications to enable the activation of a hot-standby node.

$ cp -pr ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json

$ curl -sb --request GET \
    --url http://127.0.0.1:9898/v1/tenant \
    --header 'accept: application/json'  | jq
[
  {
    "id": "15f0247fde7e6d8e6125c44f7bd17ba1",
    ... ...

$ tenant_id="15f0247fde7e6d8e6125c44f7bd17ba1"

$ curl -sb --request GET \
    --url http://127.0.0.1:9898/v1/tenant/$tenant_id/timeline \
    --header 'accept: application/json'  | jq
[
  {
    "tenant_id": "15f0247fde7e6d8e6125c44f7bd17ba1",
    "timeline_id": "a40a56c38bb581456089dd33b5379cf8",
    ... ...

We now have access to both the tenant_id and timeline_id, which can be used to make modifications to the spec-hotsb.json file. The following shows the changes that need to be made:

$ diff ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json
43c43
<                 "value": "55433",
---
>                 "value": "55434",
103c103
<                 "value": "safekeeper1:5454,safekeeper2:5454,safekeeper3:5454",
---
>                 "value": "127.0.0.1:5454",
108c108
<                 "value": "TIMELINE_ID",
---
>                 "value": "a40a56c38bb581456089dd33b5379cf8",
113c113
<                 "value": "TENANT_ID",
---
>                 "value": "15f0247fde7e6d8e6125c44f7bd17ba1",
118c118
<                 "value": "host=pageserver port=6400",
---
>                 "value": "host=127.0.0.1 port=64000",
120a121,130
>             {
>                 "name": "hot_standby",
>                 "value": "on",
>                 "vartype": "bool"
>             },            
>             {
>                 "name": "primary_conninfo",
>                 "value": "dbname=postgres user=cloud_admin host=127.0.0.1 port=55432 sslmode=disable gssencmode=disable target_session_attrs=any",
>                 "vartype": "string"
>             },      

The modifications to spec-hotsb.json mainly include starting the hot standby on a different port, configuring the IP and port for the safekeeper and pageserver, specifying the tenant_id and timeline_id used by the primary node, enabling hot_standby mode, and adding primary_conninfo.

2.4. Start hot standby compute node

In the neon shared-storage serverless solution, the compute node is stateless, and you may need to configure the replication in pg_hba.conf using psql console on Primary if you run the hot-standby on a different machine. The following commands can be used for reference in such a case, but they are not necessary if you run everything on the same machine.

$ echo -e "\nhost\treplication\tall\t\t0.0.0.0/0\t\ttrust" >> .neon/endpoints/main/pgdata/pg_hba.conf
$ psql -h localhost -U cloud_admin -p 55432 -c "select pg_reload_conf();" postgres
$ psql -h localhost -U cloud_admin -p 55432 -c "select pg_read_file('pg_hba.conf');" postgres

Create the pgdata structure exactly as shown below, and then use compute_ctl to start the hot-standby.

$ mkdir -p .neon/endpoints/hotsb/pgdata
$ ./target/debug/compute_ctl \
   --pgdata .neon/endpoints/hotsb/pgdata \
   -C "postgresql://cloud_admin@localhost:55434/postgres" \
   -b ./pg_install/v14/bin/postgres \
   -S spec-hotsb.json

You may see an error like the one below:

ERROR could not start the compute node: db error: ERROR: cannot execute ALTER ROLE in a read-only transaction
... ...
INFO shutting down

Don’t worry, run the command ps -ef |grep postgres to check if you have the hot standby running or run the endpoint list command below.

$ ./target/debug/neon_local endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 hotsb     127.0.0.1:55434  a40a56c38bb581456089dd33b5379cf8  main         0/169AFA0  running 
 main      127.0.0.1:55432  a40a56c38bb581456089dd33b5379cf8  main         0/169AFA0  running

2.5. Verify Primary and hot standby setup

Let’s run a simple query to verify the hot standby is working before we continue to pgpool setup. First, run the following commands to insert some records.

$ psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c 'create table t(a int, b int);'
$ psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c 'insert into t values(generate_series(1, 10), generate_series(1, 10));'

Next, run the following two queries against Primary and hot-standby, and you should have the same output.

$ psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c 'select count(*) from t;'
$ psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c 'select count(*) from t;'

3. Setup pgpool2

If all the above setups have been completed successfully, you can proceed with setting up pgpool2 from the source code.

3.1. Build pgpool2

Building a stable release pgpool from the source code is a simple process. You can follow these steps:

$ git clone https://github.com/pgpool/pgpool2.git
$ git checkout V4_4_STABLE
$ cd pgpool2
$ ./configure
$ make

3.2. Start pgpool2

Copy pgpoolconf.sample as pgpool.conf and add the following configuration to the end of pgpool.conf:

port = 9999 
pcp_port = 9998
statement_level_load_balance = on
sr_check_user = 'cloud_admin'
health_check_user = 'cloud_admin'
recovery_user = 'cloud_admin'
wd_lifecheck_user = 'cloud_admin'

Then, add the following parameters to pgpool.conf for the load balance test:

backend_hostname0 = '127.0.0.1'
backend_port0 = 55432
backend_weight0 = 0.1
backend_data_directory0 = '/home/david/sandbox/neon/.neon/endpoints/main/pgdata'
backend_application_name0 = 'main'

# backend_hostname1 = '127.0.0.1'
# backend_port1 = 55434
# backend_weight1 = 0.3
# backend_data_directory1 = '/home/david/sandbox/neon/.neon/endpoints/hotsb/pgdata'
# backend_application_name1 = 'hotsb'

Since we want to manually scale the compute nodes up and down through the pcp interface, we need to create a pcp.conf file and add the following two lines to set postgres as the password:

cat pcp.conf 
postgres:e8a48653851e28c69d0506508fb27fc5
# USERID:MD5PASSWD

Finally, you can start pgpool with a command like the following to observe the log:

./pgpool2/src/pgpool -f pgpool.conf -F pcp.conf -n -D -C

3.3. Verify pgpool2

To verify that pgpool is running properly, run the following command:

$ psql -p9999 -h 127.0.0.1 -U cloud_admin postgres -c 'show pool_nodes;'
 node_id | hostname  | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 127.0.0.1 | 55432 | up     | unknown   | 1.000000  | primary | unknown | 0          | true              | 0                 |                   |                        | 2023-05-12 14:31:48
(1 row)

4. Load balancing

To test the load balancing configuration, we will use pgbench to generate some load on the system. Follow these steps: Initialize the database with pgbench and run 10 queries using a for loop.

$ pgbench -i -s 2 -p 9999 -U cloud_admin -h 127.0.0.1 postgres
$ port=9999
$ for i in {1..10}
  do
    echo "Query $i"
    psql -U cloud_admin -h 127.0.0.1 -c "SELECT min(aid), max(aid) FROM pgbench_accounts" -p $port postgres
  done

After running the queries, check the pgpool status again to verify that all 10 queries were assigned to the primary node:

$ psql -p9999 -h 127.0.0.1 -U cloud_admin postgres -c 'show pool_nodes;'
 node_id | hostname  | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 127.0.0.1 | 55432 | up     | unknown   | 1.000000  | primary | unknown | 10         | true              | 0                 |                   |                        | 2023-05-12 14:31:48
(1 row)
You should see that all 10 queries were assigned to the primary node.

4.1 Scale Up

Assuming there’s a lot of load at the moment and you want to shift some read-only load to the hot-standby, follow these steps:

  1. Uncomment the backend_xxxx1 related settings in pgpool.conf.
  2. Reload the pgpool.conf file with the following command:
$ ./pgpool2/src/pgpool -f pgpool.conf reload
  1. Attach the backend1 to pgpool with password postgres using the following command:
$ ./pgpool2/src/tools/pcp/pcp_attach_node -h 127.0.0.1 -U postgres -p9998 -n 1

Run the following commands to repeat the 10 queries:

$ port=9999
$ for i in {1..10}
  do
     echo "Welcome $i times"
     psql -U cloud_admin -h 127.0.0.1 -c "SELECT min(aid), max(aid) FROM pgbench_accounts" -p $port postgres
  done

After the queries finish, check the pgpool status again. You should see that the hot-standby helped handle some read-only queries, as shown below:

$ psql -p9999 -h 127.0.0.1 -U cloud_admin postgres -c 'show pool_nodes;'
 node_id | hostname  | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 127.0.0.1 | 55432 | up     | unknown   | 0.250000  | primary | unknown | 12         | false             | 0                 |                   |                        | 2023-05-12 14:31:48
 1       | 127.0.0.1 | 55434 | up     | unknown   | 0.750000  | standby | unknown |  8         | true              | 0                 |                   |                        | 2023-05-12 15:26:24
(2 rows)

4.2. Scale down

In order to scale down, assuming there is not much load on the system, we can detach backend_1 from pgpool by modifying the pgpool.conf file and reloading it. The following commands can be used in sequence for this purpose:

./pgpool2/src/tools/pcp/pcp_detach_node -h 127.0.0.1 -U postgres -p9998 -n 1
vim pgpool.conf
./pgpool2/src/pgpool -f pgpool.conf  reload

After this, run 10 queries and check the pgpool status. The result should show that backend1 is down, and the select_cnt remains the same. The following command can be used to check the pgpool status:

$ psql -p9999 -h 127.0.0.1 -U cloud_admin postgres -c 'show pool_nodes;'
 node_id | hostname  | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 127.0.0.1 | 55432 | up     | unknown   | 0.250000  | primary | unknown | 22         | true              | 0                 |                   |                        | 2023-05-12 14:31:48
 1       | 127.0.0.1 | 55434 | down   | unknown   | 0.750000  | standby | unknown |  8         | false             | 0                 |                   |                        | 2023-05-12 15:30:55
(2 rows)

5. Summary

in this blog post, we demonstrated a simple load balancing setup using pgpool and Neon Compute nodes. We manually scaled up and down to shift read-only load to a hot-standby and remove it when the load decreased. Based on the test results, you can try to automate these processes. Overall, this load balancing setup can help improve database performance and increase availability.