Hash Index Internals

Enterprise PostgreSQL Solutions

Comments are off

Hash Index Internals

Indexes are key database server features that enhance its performance to retrieve data faster than a complete table scan (unless the index is vastly bloated). They generally work by maintaining a smaller set of data in a more “searchable” structure as compared to the complete table data.

This data organization comes at the cost of maintaining indexes which is a space and processing overhead for the database server. The benefits, however, significantly outweigh these overheads.

The most common and the default type of index in PostgreSQL is “btree”. It gets a lot of attention and updates while some of the other indexes, including hash indexes, may feel somewhat neglected. A lot of catching up was done for hash indexes since PostgreSQL version 10.

Expectations

From an outside view, it seems that hash indexes will:

  • Be very fast,
  • Allow unique constraints,
  • Support index-only scans,
  • Allow covering indexes.

Reality

Unfortunately, while the performance, when compared with “btree” indexes, is debatable, it does not support uniqueness, covering indexes or index-only scans.

To understand why such limitations exist, we need to dig a little deeper into the implementation of the hash indexes. And that is precisely what the next few sections will do.

The Storage

A hash index has at least two buckets. Tuples are placed in a bucket whenever the hash key maps to the bucket number. So the storage structure amalgamates the index storage scheme in PG and segregation of buckets.

Types of Pages

Hash indexes have four different types of pages:

  • Meta:
    Like all indexes in PG, hash indexes start with a meta page.
  • Overflow:
    When a bucket receives tuples that exceed its bucket primary page, overflow pages are added for additional entries. Overflow pages are never removed. They are either attached to a bucket or are free to be assigned to a bucket.
  • Bucket:
    Bucket’s first page that is permanently assigned to it. It contains the hash values and TID pairs.

  • Bitmap:
    The bitmap pages maintain the map of free and in-use overflow pages with 0 indicating that the corresponding overflow page is available, and 1 indicates it is not.

Allocation of Bucket Pages

The total number of bucket pages is limited to 2^32. These pages are not preallocated, but rather, added incrementally as power of 2 groups, also known as “split-points”. Thus, with every allocation, the number of buckets doubles. Starting with 2 buckets, next 2 more are allocated, then 4, and so on and so forth.

This leads to an exponential growth of allocated buckets. To curtail the memory allocation impact, for larger group numbers, the allocation is done in 4 phases. The threshold is preconfigured at bucket number 10. So, at that point, we must allocate 2^9 (we previously had 2^9 buckets in total, so this doubles that) new buckets which are divided into 4 separate phases of 2^7 buckets.

Splitting of Buckets

Since the buckets are preallocated and the number grows as more and more tuples get pushed into the index, the splitting algorithm kicks in whenever an inserter observes that index has a higher than wanted ratio of tuples to buckets. So, it attempts to split a bucket into two.

With the split happening, accurate concurrency checks must be maintained. Each backend maintains a copy of the index meta page in its relcache entry. This avoids unnecessary locking as splitting is a rare event. The number of buckets in the backend’s local copy is checked against the current state; if equal, we are good to proceed; if not, the backend updates its local copy before proceeding on with the required operation.

So splitting of a bucket and copying a subset of its entries to a new bucket makes things a little complex. To manage this, various flags including “bucket-being-split”, “bucket-being-populated”, “moved-by-split” and “split-cleanup”; where the first two flags deal with the splitting operation, the third is a per tuple flag indicating that the tuple was moved from another bucket, and finally, the cleanup flag indicates that the bucket split is not complete unless the cleanup operation happens on the primary bucket.

Scanning a Hash Index

Skipping the complications of locks and pins, in simple terms, a scan must evaluate if the desired bucket is being split. If so, it takes the buffer content lock on the old bucket in shared mode, and also retakes the buffer content lock on the new bucket. It scans both buckets, skipping the “moved-by-split” tuples in the new bucket.

The entire hash page is searched for all matching tuples. Their heap tuple ids are copied into the backend local storage.

Still Room for Improvement

The limitations I discussed earlier in this blog owe their presence to a design decision that was taken more than a decade ago with PostgreSQL 8.4 release, when it was decided that actual column value should be removed in favor of space optimization. Furthermore, the storage and scan behaviors are also tuned to the smaller storage footprint for the hash index. Therefore, adding column values will have ripple effects throughout the storage strategy for hash indexes.

Without the actual column value available with the index data, neither covering indexes, nor uniqueness can work in an efficient way. As for the index only scans, that’s a nonstarter, as it precisely attempts to reduce heap access which is not possible without having the actual data available.

Perhaps a reversal of that value removal design decision may lead to a change of fortunes for hash indexes and the elimination of these limitations.