Any performance feature in PostgreSQL that mentions the “parallel” keyword excites me. So it was natural that parallel vacuum took my complete attention when it was committed in January 2020.
For the sake of simplicity, we are not going to distinguish between index cleanup or index vacuum in this blog and simply use the term vacuum to refer to both.
The patch is authored by Masahiko Sawada and Amit Kapila. The discussion on parallelising the vacuum process has been going on in the community for a number of years, with a lot of going back and forth and tons of emails.
My expectation of parallel vacuum was that if we have 3 workers that can perform vacuum in parallel on a table with two indexes, then two workers will work on one index and the remaining worker will work on the other remaining index. Digging further into the functionality committed, sadly, this wasn’t the case. It is clear that parallelism in vacuuming of index is NOT derived from multiple workers working on the same index. So how does it work?
How Parallel Vacuum Works
There are some preconditions for parallel vacuum to happen. Vacuum on indexes runs in parallel only if:
- Parallel vacuum is enabled
- Parallel vacuum is requested
- If the table has more than one index
- The index size is greater than “min_parallel_index_scan_size”
- If and only if an index can be scanned in parallel
When parallel vacuum is enabled, a dynamic shared memory (DSM) context is prepared. The leader process collects dead tuple TIDs and store those in the DSM segment. After this, a function to vacuum all indexes is called.
Once parallel vacuuming of indexes is complete, index statistics are updated by the leader process since writes in catalog tables are not allowed during parallel mode.
Enabling Parallel Vacuum
The number of workers available to parallel vacuum depends on
“max_parallel_maintenance_workers” GUC configuration. It can be set in
“postgresql.conf” file or within the current session by using the set command. The current value can be seen by issuing the show command.
“min_parallel_index_scan_size” indicates the minimum size of index which can be considered for parallel vacuuming.
Requesting Parallel Vacuum
The syntax of the vacuum command has been modified to include an additional “PARALLEL” keyword with a mandatory integer value specifying the number of parallel workers requested. There is no guarantee that the requested number of workers will be allotted to the parallel vacuum process. You can get more detail about this from PostgreSQL documentation.
vacuumdb command has been modified to accept a
“-P parallel_degree” or
“--parallel=parallel_degree” command line argument.
If you don’t provide the
"PARALLEL" option, the number of workers is decided based on
max_parallel_maintenance_workers and on number of indexes that support parallel vacuum.
Running Parallel Vacuum
Time to have a look at an example which is a modified version of parallel vacuum test case introduced in PostgreSQL 13 along with the functionality.
Let’s connect to a running PostgreSQL 13 dev version via
psql and first check the settings of relevant GUCs for our current session.
postgres=# SHOW MAX_PARALLEL_MAINTENANCE_WORKERS; max_parallel_maintenance_workers ---------------------------------- 2 (1 row) postgres=# SHOW MIN_PARALLEL_INDEX_SCAN_SIZE; min_parallel_index_scan_size ------------------------------ 512kB (1 row)
So we’ll at most use two parallel workers for indexes with sizes exceeding 512kB. Proceeding on with creating a table, populating it and creating indexes.
postgres=# CREATE TABLE parallel_vac (i INT, a INT) WITH (autovacuum_enabled = off); CREATE TABLE postgres=# INSERT INTO parallel_vac SELECT i, array[2,7,9] FROM generate_series(1,100000) i; INSERT 0 100000 postgres=# CREATE INDEX btree_i_parallel_vac ON parallel_vac USING btree (i); CREATE INDEX postgres=# CREATE INDEX btree_a_parallel_vac ON parallel_vac USING btree (a); CREATE INDEX postgres=# CREATE INDEX btree_ia_parallel_vac ON parallel_vac USING btree (i, a); CREATE INDEX
Before we run parallel vacuum on these indexes, let’s check the index sizes.
postgres=# SELECT c.relname postgres-# , c.relkind postgres-# , pg_size_pretty( pg_relation_size(c.oid) ) postgres-# FROM pg_class c postgres-# WHERE c.relname = 'parallel_vac' postgres-# OR c.oid IN postgres-# ( postgres(# SELECT i.indexrelid FROM pg_index i WHERE i.indrelid = 'parallel_vac'::regclass postgres(# ); relname | relkind | pg_size_pretty -----------------------+---------+---------------- btree_i_parallel_vac | i | 2208 kB parallel_vac | r | 6672 kB btree_a_parallel_vac | i | 5800 kB btree_ia_parallel_vac | i | 5760 kB (4 rows)
We can see the index sizes are above the set threshold. Now run parallel vacuum with verbose and 2 parallel workers.
postgres=# VACUUM (PARALLEL 2, VERBOSE) parallel_vac; INFO: vacuuming "public.parallel_vac" INFO: launched 2 parallel vacuum workers for index cleanup (planned: 2) INFO: index "btree_i_parallel_vac" now contains 100000 row versions in 276 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.05 s, elapsed: 0.04 s. INFO: index "btree_a_parallel_vac" now contains 100000 row versions in 725 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.07 s, elapsed: 0.07 s. INFO: index "btree_ia_parallel_vac" now contains 100000 row versions in 720 pages as reported by parallel vacuum worker ... ... ... VACUUM
So we had three indexes and requested two workers; i.e. one worker and the leader process. Before we run vacuum again, we need to update rows in the table, set the number of workers to 4, and re-run vacuum with 4 parallel workers.
postgres=# UPDATE parallel_vac SET i = i WHERE i < 100000; UPDATE 99999 postgres=# SET MAX_PARALLEL_MAINTENANCE_WORKERS=4; SET postgres=# SHOW MAX_PARALLEL_MAINTENANCE_WORKERS; max_parallel_maintenance_workers ---------------------------------- 4 (1 row) postgres=# VACUUM (PARALLEL 4, VERBOSE) parallel_vac; INFO: vacuuming "public.parallel_vac" INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2) INFO: scanned index "btree_i_parallel_vac" to remove 99919 row versions DETAIL: CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.09 s INFO: scanned index "btree_a_parallel_vac" to remove 99919 row versions by parallel vacuum worker DETAIL: CPU: user: 0.02 s, system: 0.03 s, elapsed: 0.10 s INFO: scanned index "btree_ia_parallel_vac" to remove 99919 row versions by parallel vacuum worker ... ... ... VACUUM
So clearly, an additional worker was involved here as compared to the previous example; i.e. two vacuum workers and the leader process. This is the maximum number of vacuum workers that can participate in parallel vacuuming of this relation as it has only three indexes. I’m not entirely sure if this is a bug in the vacuum output as for both examples, it mentions “launched 2 parallel vacuum workers for index cleanup (planned: 2)”. In the first case, only one index is vacuumed by a parallel vacuum worker whereas two were vacuumed in the second case. I would’ve expected only two parallel worker in the first case.
In parallel vacuuming of indexes, the leader process is responsible for preparing the environment for workers. It also vacuums an index and calculates index statics once the parallel phase has completed.
The Performance Gain
This blog does not include benchmarking of parallel vacuum, here are some basic numbers to show the significant performance improvement.
postgres=# \timing Timing is on. postgres=# UPDATE parallel_vac SET i = i WHERE i < 100000; UPDATE 99999 Time: 2132.375 ms (00:02.132) postgres=# vacuum parallel_vac; VACUUM Time: 367.787 ms postgres=# UPDATE parallel_vac SET i = i WHERE i < 100000; UPDATE 99999 Time: 1376.394 ms (00:01.376) postgres=# VACUUM (PARALLEL 4) parallel_vac; VACUUM Time: 233.643 ms
In very basic and crude terms, parallel vacuum reduces the time by more than 30%. Neither the table nor the index or the database is configured to benchmark parallel vacuum properly. That said, these numbers clearly give a picture of what kind of performance gain to expect from this feature.
The Benefit of Parallel Vacuum
Parallel vacuuming of indexes is the first step towards a more parallel vacuum process where it can leverage the benefits of having multiple CPUs more effectively. Although, in this blog, we did not benchmark performance of parallel index vacuum versus serial vacuum, there are obvious performance benefits that we can see with parallel vacuum.
I believe over the course of next few months, the functionality will become more stable, it is definitely something to look forward to in PostgreSQL 13 release.
Hamid has more than 19 years of professional software development experience and has been involved with PostgreSQL for more than 8 years now. He was part of EnterpriseDB and managed PostgreSQL official installers for Windows, Linux and MacOS. Hamid has recently joined HighGo Software Inc. as Senior DevOps and Database Architect.