Bloating in database is created when tables or indexes are updated, an update is essentially a delete and insert operation. The diskspace used by the delete is available for reuse but it is not reclaimed hence creating the bloat. Same is the case with PostgreSQL database, frequent UPDATE and DELETE operations can leave a lot of unused space in table or index relation files on disk. Over the time this space can build up and cause the performance degradation for both tables and indexes. This buildup is referred to as bloated tables or indexes.
This blog will look into how bloat is created in the database and how the bloat problem can be resolved.
How does it happen:
The PostgreSQL system implements the MVCC (Multiversion Concurrency Control) and the way MVCC is implemented, it can introduce bloat’s in your system.
When an UPDATE or DELETE statement is used in PostgreSQL, it does not physically remove that row from the disk. In an UPDATE case, it marks the effected rows as invisible and INSERTs the new version of those rows. While DELETE is little simple the effected rows are just marked as invisibles. These invisibles rows are also called dead rows or dead tuples.
What this means? and why is it of any significance? Over the time these dead tuples can accumulate to a huge number and in some worst case scenarios, it possible that this accumulation is even greater that the actual rows in the table becomes unusable.
You see, these rows were marked invisible but they are still part of the table and are consuming the disk space… Assuming there are a million rows in a table, where each row takes 100bytes of disk space. This table is assumed to be consuming around 100MB of disk space. Now let’s assume there are 30% invisible rows present in the table, that would mean that 130MB’s of disk space is being utilised by the dead tuples. This looks insignificant amount but consider the real world scenario where tables use GB/TB’s of data and it becomes a serious problem.
So how to detect bloats?
There are couple of ways to detect the bloat and identify the problematic areas in your database. We will be looking at one particular method that I believe is the more accurate one.
pgstattuple
This is an extension module that can be utilized to get a clear picture of your tables. There are couple of statistic information available in postgresql that can be used for figuring out the bloat in tables and indexes. However estimates based on these can be little off. But with pgstattuple, we can get the accurate picture of actual bloat in indexes and tables. let’s consider the following example:
postgres=# CREATE TABLE test as SELECT x, md5(random()::text) as y FROM generate_Series(1, 1000000) x; SELECT 1000000 postgres=# CREATE INDEX ON test (x); CREATE INDEX postgres=# postgres=# SELECT pg_size_pretty(pg_relation_size('test')) as table_size, postgres-# pg_size_pretty(pg_relation_size('test_x_idx')) as index_size, postgres-# (pgstattuple('test')).dead_tuple_percent; table_size | index_size | dead_tuple_percent ------------+------------+-------------------- 65 MB | 21 MB | 0 (1 row) postgres=# postgres=# DELETE FROM test WHERE x % 3 = 0; DELETE 333333 postgres=# ANALYZE test; ANALYZE postgres=# SELECT pg_size_pretty(pg_relation_size('test')) as table_size, postgres-# pg_size_pretty(pg_relation_size('test_x_idx')) as index_size, postgres-# (pgstattuple('test')).dead_tuple_percent; table_size | index_size | dead_tuple_percent ------------+------------+-------------------- 65 MB | 21 MB | 29.78 (1 row)
See the table size remains the same, however the output of pgstattuple shows that 29.78% of disk space is wasted. It’s taking the space in table but not useable anymore.
Now let’s take a look at the index.
postgres=# SELECT pg_relation_size('test') as table_size, postgres-# pg_relation_size('test_x_idx') as index_size, postgres-# 100-(pgstatindex('test_x_idx')).avg_leaf_density as bloat_ratio; table_size | index_size | bloat_ratio ------------+------------+------------------- 68272128 | 22487040 | 9.939999999999998 (1 row) postgres=# UPDATE test SET x = x + 2 WHERE x % 2 = 0; UPDATE 333334 postgres=# SELECT pg_relation_size('test') as table_size, postgres-# pg_relation_size('test_x_idx') as index_size, postgres-# 100-(pgstatindex('test_x_idx')).avg_leaf_density as bloat_ratio; table_size | index_size | bloat_ratio ------------+------------+------------- 69976064 | 44941312 | 41.08 (1 row)
After the above operations, index has become 41.08% bloated. That means that the performance of this index will degrade because that much entries are either empty or pointing to dead tuples.
How to remove bloat?
So how do we remove the bloat from our database? There are couple of ways which can be utilised to avoid the bloat in tables and indexes.
VACUUM
The first thing should be the regular use of VACUUM command. This will remove the dead tuples by adding that space to free space map so that it can be reutilised by the database system. It will also try to remove index entries that point to these dead tuples. However this space will not be returned to the OS and will not shrink the table size.
Normally autovacuum background process can take care of that. The default setting are set to vacuum tables whenever their is 20% change in the table. So if you are seeing bloat’s frequently, make sure that autovavuum is enabled and also adjust the settings according to data size.
Physical Reordering
If the vacuum could not keep up and help avoiding the bloat’s then you will have to do physical reordering of the table. The physical reordering involves rewriting the whole table. There are couple of ways in which this can be achieved.
- VACUUM FULL
VACUUM FULL will remove all bloat’s in a table and its associated indexes completely and will reclaim the disk space to OS. This will reduce the on disk table size. It does all that with rewriting the whole table, however this is an expansive operation and it will lock the table or index for the duration of this operation, this is not desirable in most situations. - CLUSTER
The other option is to use the CLUSTER command. This will also rewrite the table but it does that according to the specified index. Other then that it also requires to lock the table and prevents read and write operation on the table until CLUSTER operation is completed. - pg_repack
This is an extension that is helpful in situations where VACUUM FULL or CLUSTER might not work. This extension restructures the table by creating a completely new table based on the data of the bloated table. While tracking the changes being made to the original table and will swap the new table with the original in the end.
This method does not lock the table for any read or write operations and considerably faster than VACUUM FULL or CLUSTER commands. - REINDEX
This option can be used to remove the bloat from indexes. This command rebuilds the index specified or all indexes on the table. This option does not blocks the reads but will block the writes. However one can use the CONCURRENTLY option to avoid that but it may longer to complete than standard index creation.
Conclusion
In this blog we have go over the meaning of database bloat, how it is created and what are the different ways to monitor bloat in PostgreSQL and how to get it resolved. The database bloat is a serious database issue, it can cause serious affect on the continuity of business as it can prevent a database from functioning or seriously cause performance degradation. I believe there have been attempts to address the database bloat issue in PG, one of the main purpose of the zheap project is to handle database bloat.
Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.
Recent Comments