My previous blog on “PostgreSQL High Availability: Considerations and Candidates” mostly talked about defining an HA considerations for PostgreSQL, RPO and RTO and briefly touched on some of the open source solutions available. Granted that I didn’t list them all and people shared some additional ones that I may end up reviewing as well. In case you missed my previous blog, here is the link: https://www.highgo.ca/2020/08/10/postgresql-high-availability-the-considerations-and-candidates/
I think it’ll be unfair to assess the HA solutions available before laying out all core requirements and considerations. So in this blog, without discussing any particular solution, I am going to mainly discuss setup, indirection and strategies for minimizing downtime.
Quorum and Avoiding Split Brain
An HA system is designed to not have any single points of failure. However, in a single primary PostgreSQL cluster, the primary node becomes just that. So what happens when it fails. Another node previously acting as a standby (replica) must be promoted to the primary role. However, it’s not that straight forward.
Quorum helps answer the following questions and enables us to strategies an acceptable solution. Quorum is essentially a voting system where the majority vote wins. This comes into play when there is a failure related to the primary node. We must know:
- if the primary node really failed?
- which standby to promote?
- what to do if the old primary node has rejoined the cluster?
An even number of nodes in a cluster could potentially end up with two equal sized cluster partitions. To understand this, let’s consider a two node cluster and let’s assume that the replication between the primary and secondary has failed. This begs the question which node has really failed? Is the primary or the standby?
Therefore, it is a good practice to have an odd number of nodes in a cluster and no less than three. Whether the third node is part of the PostgreSQL cluster or simply a witness node that runs certain software to assess the health of the other two nodes is up to you.
Promoting a Standby
Once the cluster has voted in favor of a standby, it must now be promoted to the primary status. There are different strategies on how this can be achieved. You may use a proprietary software like EnterpriseDB Failover Manager (EFM) from EnterpriseDB, or go with a distributed configuration system (DCS) like etcd, ZooKeeper, Consul etc. Such systems ensure that only one standby is promoted through managing locks.
For example, only one system can acquire the leader lock. After successfully acquiring it, it will attempt to promote itself to primary. If it succeeds, it must renew the lease of this lock every TTL. Failing to do so will lead to demotion or fencing for the system, and another node will attempt to acquire the leader.
STONITH, SMITH and Fencing
A temporary network outage could cause a network partition where the primary node isolated leading to standby’s promotion to primary. When the old primary rejoins the network and attempts to rejoin the cluster we may end up with two primary nodes in our cluster. Having more than one primary node in a single primary cluster will lead to a split brain scenario.
This can be avoided by:
- STONITH: Shoot The Other Node In The Head
- Where another node shuts down the old primary node by either terminating the PostgreSQL (or relevant) service, or through signalling a hardware switch to turn off the server.
- SMITH: Shoot Myself In The Head
- The old primary node sees that the cluster has moved on beyond it by electing another primary node, so to avoid any conflicts, it shuts itself down.
- Essentially is a way to keep the old primary node out of the cluster and prevent it from rejoining.
Post a primary node failure and a successful promotion of a standby node, clients and other systems in the cluster need to connect to the new primary.
This can be done in a number of ways including:
- Domain name reassignment to the new IP
- Load balancers like PgPool-II, Pgbouncer or HAProxy
- Assigning old primary node’s virtual IP to the new primary
Some middleware may internally manage the connections without exposing any changes to the client applications. However, the challenge is not entirely eliminated, it simply gets shifted from the database layer to the middleware failover and managing its indirections.
Configuring PostgreSQL for High Availability
You want to avoid unnecessary service restarts. So it is imperative that the PostgreSQL service is configured optimally for the foreseeable future taking in account any future needs that do not impact service responsiveness for the client applications.
But how do you identify which parameters in the “postgresql.conf” file to configure before starting the service. Fortunately, there is a view that can help.
SELECT name , CASE context WHEN 'postmaster' THEN '* RESTART *' WHEN 'sighup' THEN 'Reload' WHEN 'backend' THEN 'Reload' WHEN 'superuser-backend' THEN 'Reload - Superuser Session' WHEN 'superuser' THEN 'Reload - Superuser SET' WHEN 'user' THEN 'Reload - User SET' END AS Service_Context FROM pg_settings WHERE context != 'internal' ORDER BY Service_Context;
Note all the attribute names that appear have “* RESTART *” value for “service_context” column and ensure that these are set accurately to avoid any future restarts. You may additionally add “short_desc” and “extra_desc” columns to the above select statement for more details about these attributes. For more information about this view, check the PostgreSQL official documentation.
Monitoring plays a key role in achieving high availability. It can lead to detection of issues before they actually happen; say by identifying excessive CPU or RAM usage. While the health of the entire system must be monitored, health checking must include certain PostgreSQL elements.
Bad connections will invariably lead to trouble (pun intended). In the context of a highly available system that’s designed to run a long time, imagine a connection (or many) that may never drop and the query runs forever. It is therefore really critical that we keep a close eye on long running transactions/connections as they’ll be holding back key database activities like vacuum, or index creation.
To start with, idle_in_transaction_session_timeout GUC should be set appropriately such that normal running clients may never exceed this timeout. For more information about connections, explore the pg_stat_activity view that provides a lot of details which may help in identification of long running connections, applications and queries. See the column list for the view and note the column “state” which provides useful information about a connection.
postgres=# select datid, datname, pid, state, usename, application_name, backend_start, backend_type, query::varchar(10) from pg_stat_activity; datid | datname | pid | state | usename | application_name | backend_start | backend_type | query -------+----------+-------+--------+---------+------------------+-------------------------------+------------------------------+------------ | | 39559 | | | | 2020-11-18 13:05:08.22346+05 | autovacuum launcher | | | 39561 | | virus | | 2020-11-18 13:05:08.224053+05 | logical replication launcher | 12926 | postgres | 40264 | active | virus | psql | 2020-11-18 15:25:04.768399+05 | client backend | select dat 12926 | postgres | 69567 | idle | virus | psql | 2020-11-25 02:58:29.596571+05 | client backend | | | 39557 | | | | 2020-11-18 13:05:08.222998+05 | background writer | | | 39556 | | | | 2020-11-18 13:05:08.222704+05 | checkpointer | | | 39558 | | | | 2020-11-18 13:05:08.223214+05 | walwriter | (7 rows)
From within PostgreSQL, you can terminate a client by issuing either of the following commands:
-- Stop the currently executing query identified by the pid of the backend SELECT pg_cancel_backend(pid); -- OR SELECT pg_terminate_backend(pid);
If the application is still running, you may need to stop the client application or terminate it’s TCP connection on the PostgreSQL server.
Index Creation; Concurrently
When an index is created, the underlying table is locked in shared exclusive mode until the index has been created. This presents any insert, update or delete on the table.
One way of avoiding this issue is using “CREATE INDEX CONCURRENTLY” syntax while creating a new index. Although you could potentially run into any long running queries that predate this index creation command, it gives another option worth considering when creating an index.
Finally done with most of the software related considerations for a high availability setup for PostgreSQL. It now gears us up to review the solutions available. Hold tight, as I learn a few more tricks and work on my next blogs.
PS: I deliberately didn’t go too deeply into indirections, HAProxy and the likes of Pgbouncer as that might be a complete blog in itself.
Hamid has more than 19 years of professional software development experience and has been involved with PostgreSQL for more than 8 years now. He was part of EnterpriseDB and managed PostgreSQL official installers for Windows, Linux and MacOS. Hamid has recently joined HighGo Software Inc. as Senior DevOps and Database Architect.