Recently I have been involved in creating a solution that guarantees cross-partition uniqueness within a partitioned table consisting multiple child tables. Some people refer to this feature as a global index and there has been some discussion about it in the email thread here. Though the idea is good, the approach sparks a lot of controversies as the approach changes how partitioned indexes are stored. It basically store all partitioned index together as one and uses
TableOid as a key to reference internally. My team and I are experimenting an alternative approach that guarantees cross-partition uniqueness without changing the fundamentals, but before we can do that, we have to understand how uniqueness works in partitioned tables in PG.
2.0 CREATE INDEX Uniqueness Guarantee
PG follows this basic procedure to check uniqueness violations
- perform a heap scan on the target child partition table
- store the visible tuples into a
BTSpoolstructure (spool1) and dead tuples into another
BTSpoolstructure (spool2), so there are 2 BTSpool structures used and spool2 can be NULL if a table has no dead tuples or uniqueness is not required. A spool structure can be understood as a collection of index tuples.
- perform sort on spool1 and spool2 if available
- the sorting algorithm is equipped with duplicate detection, if two of the same tuples are subsequently sorted together, a duplicate is found and will raise Error here if uniqueness is required.
- if no duplicate is found from sorting, PG will build the index tree based on spool1 and spool2
- Destroy all the spools when index creation is done
- This logic is located in btbuild() function within
src/backend/access/nbtree/nbtsort.cand is invoked from DefineIndex function in indexcmds.c. btbuild will be called multiple times from DefineIndex depending on the number of active child partition tables.
3.0 INSERT And UPDATE Uniqueness Guarantee
At planner and optimizer stage, PG already know which child partition table the new data should be inserted or updated to
- PG will first inserts the heap tuple in target heap relation.
- then it will attempt to insert new index tuples associated with the heap tuple by calling
- when uniqueness check is required by the index, PG will construct a scan key from the new heap tuple and try to fetch an existing matching tuple from the heap partition table by calling
- if no matching heap tuple is found from the current child partition, then there is no conflict
- if a matching heap tuple is found from the current child partition, additional checks will be done in below:
- if the found tuple is not yet committed, (for example, another backend is still working on this tuple and has not yet committed), the process will wait here until the other process commits or aborts.
- when the other process commits or aborts, the process will re-fetch the same tuple again.
- if the other backend aborts, the duplicate tuple should fail to be fetched, and therefore no conflicts
- if the other backend commits, the duplicate tuple can still be fetched, and therefore a potential conflict.
- before raising error, PG will do one more checking, which is to fetch the visibility status of the heap tuple that PG is currently inserting. This is to cover a case where another backend is currently doing CREATE UNIQUE INDEX CONCURRENTLY while the current backend is trying to insert or update the data.
- check if the current tuple to be inserted can be fetched from the heap relation
- if yes, then the current tuple is still visible and there is definitely a conflict
- if no, then the current tuple has become invisible and is not considered a conflict
- continue with index tree creation if no conflict is found
This logic above is mostly located in
_bt_check_unique() functions within
4.0 ATTACH Uniqueness Guarantee
The table to be attached can either have unique (or not unique) index already defined or have no index at all. There are 2 potential cases during ATTACH:
- Attaching a table with no index to a partition table, PG will automatically create a new index for the attached table following similar index parameters as the original partition table. But currently. This index creation follows the same procedures defined in
- Attaching a table with a unique index defined using the same index key as partition table’s global unique index, PG will not create a new index for the attached table, and will simply attach the table.
Cary is a Senior Software Developer in HighGo Software Canada with 8 years of industrial experience developing innovative software solutions in C/C++ in the field of smart grid & metering prior to joining HighGo. He holds a bachelor degree in Electrical Engineering from University of British Columnbia (UBC) in Vancouver in 2012 and has extensive hands-on experience in technologies such as: Advanced Networking, Network & Data security, Smart Metering Innovations, deployment management with Docker, Software Engineering Lifecycle, scalability, authentication, cryptography, PostgreSQL & non-relational database, web services, firewalls, embedded systems, RTOS, ARM, PKI, Cisco equipment, functional and Architecture Design.