Horizontal Scalability / Sharding in PostgreSQL Core – Missing pieces of the puzzle

Enterprise PostgreSQL Solutions

4 comments

Horizontal Scalability / Sharding in PostgreSQL Core – Missing pieces of the puzzle

Back in August 2019, I wrote multiple blogs with the title of “Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 1 .. 3”. Little has happened since then, the purpose of this blog is discuss the important missing pieces of the puzzle, what are the minimum set of features needed to get to MVP (minimum viable product) and most importantly which efforts are currently going on to get to the MVP of Sharding in PostgreSQL core. The blog also mentions some features that are far fetched and potentially not needed for MVP however they are really important for enterprise level deployment of a distributed cluster solution. Last but not the least the blog will continue to emphasise the importance of this feature in the core of PostgreSQL.

Recap on FDW based Sharding

Just to recap, sharding in database is the ability to horizontally partition the data across one more database shards. It is the mechanism to partition a table across one or more foreign servers. While declarative partitioning feature allows the user to partition the table into multiple partitioned tables living on the same database server. Sharding allows the table to be partitioned in a way that the partitions live on external foreign servers and the parent table lives on the primary node where the user is creating the distributed table. 

The build-in sharding feature in PostgreSQL is using the FDW based approach, the FDW’s are based on sql/med specification that defines how an external data source can be accessed from the PostgreSQL server. PostgreSQL provides number of foreign data wrapper (FDW’s) that are used for accessing external data sources, the postgres_fdw is used for accessing Postgres database running on external server i.e. the shards. Using the FDW based sharding, the data is partitioned to the shards, in order to optimise the query for the sharded table, various parts of the query i..e aggregates, join etc are pushed down to the shards. This enables the heavy query processing to be done on the shards and only results of the query are sent back to the primary node. 

Please note that FDW based sharding the approach that PostgreSQL community is following in order to implement this feature. Using the FDW architecture surely adds some overhead which can be avoided by other more sophisticated cross node communication techniques. The trade-off with FDW sharding is that it is using a trusted architecture and it is more simpler and relatively less time consuming to implements as compared to other methods…

My series of blogs mentioned in the introduction discusses the FDW sharding feature and its architecture in detail, only mentioning the summary here for the context.

The short example describe how a sharded table can be created today using the postgres_fdw.

Following commands are executed on the primary node, the foreign server is creating pointing to the database shard, the user mapping for the shard is created accordingly.

The parent table is created on the primary node :

The partition is created on the primary node :


The following partition is created on the shard :

Missing pieces of the puzzle

This section talks about the missing features that are required to achieve the MVP for sharding in PostgreSQL. Again I talked about these features in my earlier blogs however the balls has moved forward slightly on these since my blogs from August 2019. I am providing the latest updates on these features and some rough guesses on when we can see these in PG however it is never easy to predict when a feature will get committed to PostgreSQL.

1- Global transaction manager (Two Phase commit for FDW transactions)

Currently FDW transactions don’t support two phase commit, this means that if you are using multiple foreign servers in a transaction and if one part of transaction fails in one foreign server then the entire transaction on all foreign serves are suppose to fail. This feature is required in order to guarantee data consistency across the database cluster. This feature is required in order to support OLTP workload hence it is very important for sharding feature.

The patch for supporting two phase commit for FDW transactions was submitted to the community few years back. There has been number of back and forth on this feature and so far the design is not blessed by the senior members of the community.

Masahiko Swada recently added the patch to the commit fest, the commit fest entry is given below…

https://commitfest.postgresql.org/26/1574/

This is the rebased and updated patch, it was reviewed by several community members and updated patches were submitted to hackers.

It is fair to say that this patch requires more work and more testing for failure and other scenarios before it will get a serious look from the senior community members. Muhammad Usama from HighGo Software is currently working on the updating the patch to improve the patch quality, perform comprehensive testing and fixing any bugs he finds while doing so. He will be submitting the latest patch to community soon.

It is very late in the game for PG-13 and but it looks promising for PG-14.

2- Global snapshot for cluster-wide read consistency

The purpose of global transaction/snapshot manager is suppose to provide global transactional consistency.  This is needed in-case of cross node transaction in order to achieve proper snapshot isolation. Transactions that span across multiple nodes need to have a consistent view of the cluster, the global snapshot feature guarantees that distributed (i.e. cross node transactions) transactions get a consistent snapshot of the cluster. 

Lets suppose you have two concurrent clients that are using a sharded table, client #1 is trying to access a partition that is on server 1 and client #2 is also trying to access the partition that is also on server 1. Client 2 should get a consistent view of the partition i.e. any changes i.e. updates etc made to the partition during client 1 transaction shouldn’t be visible to client 2. Once client 1 transaction gets committed, the charges will be visible to all new transactions. The Global snapshot manager is suppose to ensure that all cross-node transaction gets a consistent view of the database cluster. All the concurrent clients using the database cluster (with tables sharded across multiple foreign servers) should see consistent view of the database cluster.

Postgres Professionals have done allot of work in this area sometime back but the patches submitted by them to support global snapshot have stalled in the community. Here is link to the community thread for implementing global snapshot manager : 

https://www.postgresql.org/message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru

There are following the approach of CSN (commit sequence number) based snapshot for providing atomic visibility. The CSN based snapshot will get integrated with Clock-SI to provide global snapshot feature. Please read the above thread to understand the functionality and architecture of the patch.

I believe the patches submitted to Postgres Pro are pretty good and in the right direction for achieving global snapshot isolation for cross-node transactions. However the sequence of how these patches were submitted to the community is one reason of why they didn’t get the due attention.

HighGo software is working on this feature basically leveraging the work done by Postgres Pro. We are starting with CSN based snapshot since that is basic underlying block. We are working on submitting a rebased, enhanced and well tested version of this patch, we are also trying to convey why the CSN based snapshot is valuable on its own without the global snapshot manager.

After completing the above, the plan is to work on the integration of CSN snapshot and clock SI algorithm for achieving global snapshot feature.  

3- Parallel FDW scan

The executor re-write planned for a few years now in order to provide asynchronous query execution is still on hold with no visible signs of making any progress. Apparently it requires a senior committer who has a in-depth understanding of the executor to perform the executor refactoring. The following patch is added to the commit fest, it provides asynchronous append of Postgres-FDW nodes, the idea is that append on postgres_fdw can run simultaneously hence reducing the overall query time significantly. For example aggregating that can be pushed down to the remote servers can be executed in parallel on the remote servers.

https://commitfest.postgresql.org/27/2491/

Horiguchi has provided the benchmarking results that shows that the patch doesn’t cause any performance degradation for non-async append queries for non Postgres_fdw scans. 

Movead from HighGo Software performed benchmarking of this patch on AWS in a distributed environment. Please look for community thread with subject “Append with naive multiplexing of FDWs”, it provides the benchmarking details and shows significant performance gains with multiple shards.

While this patch is destined for PG-14, Thomas Munro has provided a simple and interim solution that provides Append with multiplexing of FDWs. This a trade-off until we have a better solution, it allows multiplexing of FDW scan under the Append node. This patch is targeted for PG-13, it will be win if we can prove that change provides the performance benefits and doesn’t cause any degradation for Append for non-FDW scans.

3- Shard management

The partitions on foreign servers are currently not getting created automatically, as described in section above, the partitions needs to be created manually on foreign servers. This can be very tedious task if you are creating a partition table with large number of partitions and sub-partitions. The shard management feature is suppose to provide the ability to auto-create the partitions and sub-partitions on the foreign servers. This will make the creation of sharded tables very easy.

This feature was added to highgo postgresql (HG-PSQL) 1.0 version, the latest version of highgo postgres is available for download here : https://www.highgo.ca/products/

This feature is not part of community PostgreSQL yet.

Summary

The above pieces are the minimum set of features required for MVP of Sharding and in order to consider it for a real world workload. There are some arguments that with parallel FDW scan, sharding can used for basic OLAP use-case however without 1 and 2 listed above it cannot be considered for a OLTP or a mixed workload.

As i mentioned earlier in the blog that there are features that aren’t needed for achieving MVP of sharding however they will be required enterprise level deployment of sharding. Section below briefly list down these features.

Shard Cluster Management and Monitoring

The shaded cluster management and monitoring is a very important aspect of this feature,  while this is a enterprise level feature, it is very critical for any distributed cluster solution. The cluster management will provide features like the ability to add or remove shard from the cluster. When new shards are added to the cluster or existing shards are removed from the cluster, the data needs to be partitioned accordingly which should be handled by the cluster management tool.

The monitoring part of the tool will provide the ability to monitor the cluster node and provide health check statics of the nodes. 

Shard Cluster backup and recovery

Performing backup of the whole cluster and doing recovery in-case of a failure or crash is the most important to any enterprise level user. The cluster backup and recovery tools needs to provide the ability to perform backup of the entire cluster and perform recovery in-case of crash or failover. I am not sure whether this mean backup or restore of each shard individually or a feature that provides this functionality seamless for the cluster. More thinking is required for this… 

Shard Cluster High Availability

High Availability is very crucial for any enterprise application, the importance of this is increasing very rapidly. As more and more solutions for HA for PostgreSQL emerge, there is no solution that provides a cluster level high availability solution. The need for a cluster wide HA will become paramount once the sharding feature is ready for production. The ability to provide automatic failover for cluster nodes including the primary node and the shards.

Conclusion

The PostgreSQL community has been working on horizontal scalability / sharding in the core for a few years now. I have been part of this journey from the start, the FDW based sharding met a lot of resentment in the beginning. However the community seems more acceptable to this approach of sharding for the reasons given in the introductory section. The other solutions for horizontal scalability outside of core i.e. postgres xc/xl, pg_shards etc got some success but overall it is evident that they aren’t able to fulfil the market requirements in this area. 

Number of companies like EDB, NTT and more recently HighGo Software Inc has shown a great deal of commitment for getting this feature in the Core. They have committed full-time resources for working on this feature in the community, lots of valuable features like the FDW pushdown’s are a result of effort, these features are very valuable on there own.

However with PG 13 and partly with PG-12, it seems that the momentum for this features has somewhat slowed down. The features that I listed in the “Missing pieces of the puzzle” section are mandatory for any read/write read world workload before we can consider adopting this solution. I believe the community needs to pay more attention to the efforts that are going on to implement these features in the Core.

4 Responses

  1. Ahsan Hadi Jobin Augustine says:

    This is an awesome write up about the necessary building blocks for the sharding infrastructure.
    Enjoyed reading this. Great job Highgo team.

  2. You mentioned pg_shard as another example of doing horizontal partitioning within PG. pg_shard is deprecated and was integrated into the latest Citus extension for PostgreSQL. I wonder how Citus handles Global Transaction Manager/2PC and Global Snapshots (isolation level).

    • Ahsan Hadi Ahsan Hadi says:

      Hi Micheal – I posed this question the Citus team and this is what i got back..

      At a high level, Citus provides the following guarantees: ACID semantics for queries that are scoped to a single machine, and ACD semantics for queries that span across machines.

      Queries that are scoped to a single machine is a pretty common in multi-tenant databases. In that case, Citus relies on Postgres to handle the ACID transactions on the single machine. Within a transaction block on the coordinator, send all the commands to the worker over the same connection and have the idential transaction block on the shards. That provides the ACID semantics for transactions that involve only a single machine.

      In multi-machine transactions, there can be different scenarios. First, queries that “modify” the same rows, such as concurrent bank balance transfers between two bank accounts, where the underlying table is distributed by account id. In that case, Postgres serializes access to the same row in each shard, and one of the transactions will block. That provides “I”solation gurantee even for multi-machine transactions. In the second case, assume one of the two transactions is a modifying the database and the other is an analytical query touching multiple machines (or vice versa). In this case, the analytical query may partially see the affect of modifications (e.g., some shards may return commited result after modification, some may not.). Note that the window for such cases are very minimal in practice as Citus does mofications involving multiple nodes with 2PC, but of course it could happen. Technically, we sacrifice “I”solation for performance in this scenario. Implementing isolation for these transactions requires some sort of communication between the transactions, which means additional latencies. And, performance is key for many of our customers — as they prefer to scale out their workloads mostly for better performance.

      As far as we can see, there is one more important angle to this trade-off. When we talk to customers/users, we see that ACID particularly important for SaaS apps. In SaaS / multi-tenant apps, customer usually shards by tenant_id. So all queries for a business are scoped to a single node and get important ACID benefits. In most other cases, as the data sizes are typically quite large, users don’t specifically look for transaction isolations across distributed transactions.
      Finally, as a semi-related concept, allowing concurrent distributed transactions means that there could be deadlocks that involves multiple nodes. Thus, Citus has a distributed deadlock detection, which behaves very similar to Postgres’ deadlock detection. The main difference is that the input for the deadlock detection is gathered from all nodes in the cluster, not a single node as Postgres does.

Leave a Reply

Your email address will not be published. Required fields are marked *