I recently had the opportunity of taking part in couple of interesting talks on the future of sharding in PostgreSQL. The first talk was delivered by Bruce Momjain in PostgreSQL conf Ottawa (May 2019) in which he presented the future of Sharding in PostgreSQL and talked about current state and future of built-in sharding in PostgreSQL. The second talk was presented by myself in PostgreSQL conference in Beijing (China – July 2019) in which i also talked about the current state and future of built-in sharding in PG. I essentially used the slides from Bruce’s presentation (with his permission of-course) and added few more slides discussing the requirements that warrant scalability and also talked about other alternatives for global transaction manager.
During the PostgreSQL conference in Ottawa, I continued with the tradition of arranging a breakfast meeting during the conference week to discuss the sharding topic. We have been doing this every year for last several years when i was working with EnterpriseDB and continued the meeting now that i am working with HighGo Software Inc. The breakfast meeting this year was attended by folks from EDB, NTT, HighGO and Pivotal, the number of participants this year were more then what we have been getting for last years so we had to opt for a bigger breakfast table. 🙂 The purpose of this meeting to get folks from companies that are working on this feature in the community and this time also folks from companies that intend to contribute to this feature in the community. The agenda is to discuss the current state of built-in sharding, ongoing work, what’s coming down the pipe. The other topics discussed in this meeting is to discuss the overall goal for the sharding feature, what use-cases we expect to benefit from this features, what are the rough time-scales for getting something in PG that we can called sharding.
About this Blog
I was initially hoping to cover this topic in one blog but since this feature is so huge the blog also turned out to be much longer then what I was initially expecting to write. I have covered this into 3 blogs so people don’t get tired reading it and loose there interest. The conclusion is given at the end of 3rd blog of this series.
The first blog of the series will discuss scalability, what are different types of scalability and what are currently available options for scalability in PostgreSQL. The second part of the blog focuses on declaratively partitioning added in PG 10 and how sharding in PostgreSQL works today. The third part of this blog focuses on Built-in sharding architecture, pushdown capabilities already in PG and lastly the remaining features and there associated challenges.
One thing that I didn’t talk about in this blog is other solutions that have tried to provide sharding, there are postgres-xc,xl and pg_shards of this world. I do believe some of these attempts have been impressive but all have there limitations which prevented them for reaching prime time with PG community.
What is Scalability and why we need it
Before we get into sharding, it is important to understand “what is scalability” and the need for scalability and what are different ways to make your database scalable. Database scaling is basically the ability to increase the database throughput/performance by increasing resources such I/O, memory, CPU or adding additional computers to the cluster. The addition of more computes in-turn also means increasing the database performance by using resources from multiple systems to do the job. Your database only needs scalability if you are running your PostgreSQL database on a single standard server and it is not able to meet the need of your workload. The symptoms that points to reason of needing scalability are the database not performing well, the data size is growing very large and it can’t fit in the server’s storage, the database is getting too many concurrent clients that it is not able to handle, there are complex analytical queries that are taking too long to run or not returning any data at all in some cases and grinding the system to a halt.
When the user starts to see some of the problems mentioned in the previous paragraph, they need to think about various options to address this issue before they ultimately start thinking about scalability. The various options that user should consider before thinking about scalability are the following :
- Schema optimisation – Need to make sure that the schema is fully optimised to handle the workload and the type of queries that are sent to the database server. Important aspect is to check whether the user tables are big enough to be partitioned so the queries can routed to smaller size tables. Also need to think whether the tables are properly normalised.
- Application logic tweaking – Another important area that we need to think about when running into issue is application logic tweaking. The user can think about whether the application can be tweaked to get the best performance from the database. This can involve rewriting some of the database queries or in some cases doing some refactoring at the application level to speed-up performance.
- Database configuration parameter tuning – This is probably most important suggestion from the options, the database configuration parameters needs to be turned according to the workload and type of traffic generated for the database. The values for important parameters like shared memory, work mem, checkpoint etc need to configure properly and according to the worklaod. This is a good source that discusses some of the important configuration parameter for PG database tuning.
Having tried all of the above options or more and the problems still persist then it is best to start thinking about scalability options. The database can be scaled in two ways :
The two options of scalability are fairly simple to understand, vertical scalability is about scaling upwards and horizontal scalability is about scaling side ways. Vertical scalability is achieved by enhancing the single server, adding more resource to the server to achieve scalability. Horizontal scalability is about throwing more computers at the problem, adding more nodes to the cluster in order to use resources from multiple machines to gain database scalability.
Vertical scalability is the way to go if it meets the user requirements and it is also cost effective for the user. The reason vertical is the preferred option is due to easy of configuration and minimum maintenance and manager overhead. You can simply beef up the server with more faster storage, more memory and add more and faster CPU to increase the performance of your database. There are mainly two reasons where vertical scalability is no longer the option for your workload. First and foremost reason is that enhancing the server can be very costly and at some point it may become cost in-effective for the user. The second reason is that the workload requirements i.e. storage, number of concurrent clients and incoming traffic becomes so much that the single server couldn’t scale to cater for the workload. There are discussion’s that PostgreSQL can only scale unto 64 cores and after that the line becomes linear. However they are published benchmark results that show PostgreSQL scaling on more then 64 core, I believe the core based scaling is relative to number of concurrent clients and database traffic.
Horizontal scalability is about adding more then one computer to the database cluster in order to use the resources of additional computers to gain horizontal scalability. Horizontal scalability is not the first choice because it involves the overhead of configuration and maintenance of cluster of machines so the user need to think about downtime, failover, backup etc of all the machines in the cluster. Horizontal scalability becomes the obvious choice if the workload requirements can’t be satisfied with a single server for the reasons given in the previous paragraph. There are ways to get horizontal scalability even without sharding, the most popular solution of non-sharding horizontal scalability is read scalability with Pgpool II. Pgpool II is middleware product that sit in between the client and PostgreSQL cluster and provides functionality like connection pooling, load balancing, failover etc. Pgpool II provides horizontal scalability by load balancing, sending the writes to primary node and load balancing the read statements across the stand-by nodes. Pgpool II provides connection level and statement level load balancing, with connection level load balancing Pgpool II process makes a connection with a stand-by node, all the reads for the connection is send to the stand-by node and writes are sent to the primary node. With statement level connection (coming in Pgpool II 4.1), every statement is load balanced across the stand-by nodes. There is a good article on this topic https://highgo.ca/2019/07/19/pgpool-ii-4-1-taking-the-bull-by-its-horn/