Global Unique Index ATTACH Support and Its Potential Deficiency

Enterprise PostgreSQL Solutions

Comments are off

Global Unique Index ATTACH Support and Its Potential Deficiency

1.0 Introduction

In my previous post here, I introduced the global unique index feature that my colleague, David, and I work together and explained how global unique index guarantees cross-partition uniqueness during CREATE. In this blog, I will explain how we implement cross-partition uniqueness with ATTACH and a potential deficiency in this approach.

2.0 Global Uniqueness with ATTACH

When a regular table is attached to a partitioned table containing a global unique index, there are potentially 2 cases:

  1. the regular table has no matching unique index as the global unique index on the partitioned table
  2. the regular table has a matching unique index as the global unique index on the partitioned table

In the case of (1), the system will attempt to create a new global unique index on the partition-to-be that has the same properties. If you recall from the previous blog, the global uniqueness for CREATE works by doing heap scan on every partition, put the results in a global spool structure, and do a final sorting on the tuples inside global spool. This is the case with CREATE because by default it will recursively create a global unique index on all partitions. However, for insert, there is only one global unique index to be created on the partition-to-be, as the rest of the partitions already have global unique index. This means that the global spool structure would only contain the tuples of the partition-to-be, so it cannot be sorted to ensure global unique.

To remedy this issue, we added new logics to derive the OID value of each partition from the partitioned table and iteratively call the index create routine on each OID, with a special marking to indicate that this call comes from ATTACH. This special marking is very essential because it tells the index create routine to skip the actual index creation part. This is important becasue we do not really wish to create a new index for the existing partitions because they already have one. The special marking tells the index create routine to perform heap scans on the partition, put the tuples in the global spool, and finally skip the creation part.

we do this for each existing partition on the partioned table, so the global spool will eventually contain tuples from all existing partitions. Finally we will call the index create routine on the actual partition-to-be without the special marking. This will cause the routine to do a heap scan on the partition to be and save the tuples in the global spool that already contains tuples from other partitions. Then it will do a final sorting on the global spool to ensure global uniqueness.

In the case of (2), there is no reason for the system to create a new unique index on the partition-to-be, so we simply perform a relkind change, from RELKIND_INDEX to RELKIND_GLOBAL_INDEX. This is needed because this relkind determines if global uniqueness shall be enforced or not. Since global unique index only applies to partitioned table, when a partition is detached, we will change the relkind from RELKIND_GLOBAL_INDEX back to RELKIND_INDEX.

After the relkind change, we still have to ensure that the attached table does not have a value conflict with any existing partitions. To do this, we still have to derive the OID value of each partition from the partitioned table and iteratively index create routine on each OID with a special marking to skip the actual index creation. So we basically are re-using the exiting index creation routines to populate the global spool and perform a final sort on the last partition to ensure global uniqueness

The image below illustrates this process:

3.0 Potential Deficiency with this ATTACH approach

The implementation described above will ensure the global uniqueness, but its performance could be very slow under 2 cases:

  1. when there is a lot of concurrent INSERTs happening while we are attaching
  2. when the partitioned table has a lot of partitions

In the case of (1), when PostgreSQL is undergoing massive INSERT requests, it will try to acquire a exclusive lock on the relation (or in this case, an individual partition with global unique index). Other backends may do the same and acquire a exclusive lock on other partitions with global unique index. When we attach a regular table to a partitioned table with global unique index, the system will need to derive the OID of each individual partition and at the same time acquire a shared lock on the partition for reading. This will have to wait if there is any concurrent INSERT operation holding a exclusive lock on one or more of the partitions. In other words, attach needs to acquire a shared lock on all partitions before it can perform global uniqueness check, which means it has to wait for all concurrent INSERTs to commit before it can start doing its things. This could cause massive waiting on locks if database is busy with DML operations.

If we were to make attach to acquire shared access lock instead of shared (similar to CREATE INDEX CONCURRENTLY implementation), it will not be blocked by concurrent INSERTs, but the global uniqueness cannot be always guaranteed because there is always a chance that attach finishes checking a partition and the concurrent INSERT adds a new duplicate to the partition that we have just checked. My ideal outcome would be to have attach to proceed concurrently with concurrent INSERTs and whoever finishes first, let’s say attach, the concurrent INSERTs would fail at some point as it detects a duplicate as a result of the successful attach; if the INSERTs finish first, then the attach will fail as it finds a duplicate as a result of concurrent INSERTs. This is my ideal outcomes, but it requires some more experimentation to make it right.

In the case of (2), the higher the partition counts, the more iterations there are to check for uniqueness, which means lowered performance. Current PG does not check all partition for uniqueness violation because it does not support it, so its attach is very fast. But we have to check all to ensure uniqueness globally. This is yet to be enhanced somehow. If you have some good ideas to make it faster, feel free to reach out to us.