Highlights from PostgreSQL Conference 2024 in Seattle

Enterprise PostgreSQL Solutions

Comments are off

Highlights from PostgreSQL Conference 2024 in Seattle

Introduction

PostgreSQL Conference was held on November 6th and 7th, 2024 as part of PASS Data Community Summit in Seattle. Bringing together database enthusiasts, developers, and industry professionals from around the world, the event offered an exceptional platform to delve into all things database.

Although nearly a month has passed since I had the opportunity to attend, the information shared during those two days have left a lasting impression. This article serves as a summary of the most talks I participated in. Overall it’s been great!

Understanding Database Statistics – Anita Singh, Raj Jayakrishnan

  • There is an entry in pg_stats per user table created
  • statistics are used for planner to estimate and optimize queries
  • metrics like fraction of nulls, average size, number of distinct values, most common values per column, most common update frequencies per column, correlations…etc are recorded
  • we can use ANALYZE to see these metrics being used in query planning
  • We can alter a particular column’s statistic values using ALTER TABLE SET STATISTICS
  • Extended statistics can be used to correct and optimize planner’s query decisions
  • examples are provided by the speakers to analyze common postgresql cases:

PostgreSQL for Oracle DBA – Thuymy Tran

  • introduced major differences between oracle and postgresql
  • PostgreSQL uses MVCC, has data bloat issue if no vacuum in time, depends on transaction ID to determine visibility, has transaction wraparound issues,
  • Oracle has separate storage for UNDO, physically determine data visibility rather than logical, no special case to clean up table or index bloat.
  • concept of tablespace is different
    • oracle is a logical meaning to categorize tables
    • postgresql means another physical directory for storage
  • postgresql has more index types to select from
  • postgresql has no support for SYNONYM but oracle does
  • postgresql support extensions, but oracle does not
  • talked about index compatibility table
  • number and numeric has different decimal scopes
  • empty string and null are equal in oracle but not in postgresql, in postgreqsl null does not equal null but in oracle they equal
  • postgres uses coalesce() to handle null
  • talked about AWS’s migration solution and workflow and best practices

Collation Surprises: Did Postgres Lost My Data? – Jeremy Shneider

  • data collation can change significantly when running postgresql in different glibc versions (or different operating systems). Centos7 → 9 , ubuntu 18 → 24 for example
  • they change string characters that are used as primary key or created as index in which they are sorted in order
  • this sort order can change when glibs collaction versions change
  • for example, doing an upgrade to another platform or doing replication on standby running different operating systems
  • when sort order change due to collation, the query can return incorrect results because index lookup no longer returns sorted results
  • inserting new data to this system would corrupt the index tree
  • unique constraint may fail to ensure uniqueness
  • random crashes + incorrect results.
  • always run REINDEX when migrate to another platform
  • beware of your glibc upgrade
  • use old glibc if unsure, or use ICU’s

Overcoming Migration Challenges from Oracle to PostgreSQL – Baji Shaik, Sameer Malik

  • lots of reason to migrate to postgresql: powerful vector support via pg_vector, powerful geographic support via postgis, citus for distributed database support + the flexibility provided by extension framework
  • PostgreSQL’s numeric and oracle’s numbers are different. Have different decimal point scopes and the number of digit limits differ
  • introduced best practices to map different numbers based on workflow
  • null and empty string are the same in oracle but not in postgresql
  • null = null in oracle but not in postgrsql
  • introduced how to handle null values under each business cases
  • date and timestamp representations are different in postgresql and oracle, need to handle the different in during migration

Time Travel Queries with Postgres – Harry Pierson

  • long time ago, postgresql supported time travel query, which allows user to query a column data as it appeared some time ago (may be seconds ago, minutes ago or years ago).
  • This feature has been removed long time ago due to significant performance impact to the system. It makes each update write twice as much data and data bloat is even worse
  • DBOS team shared their implementation that remedies the performance and data bloat drawbacks
  • Used logical replication to replicate not only the application data, but also historical data to a standby node. Special code is written such that each data will be appended to the history table on the standby
  • a time travel proxy is implemented to sit in the front of the postgresql database that user connects to: if user wants current data, the proxy forwards it to primary, if the user wants historical data, the proxy forwards it to the standby using special query syntax to select how far in the past we want to query.
  • this is basically like being able to do a query with selective snapshots and no vacuum, which allows user to see historical and removed data if he can freely set current snapshot to the past.

Collation Challenge – Joe Conway

  • introduce classic problems with collation version changes, which cause sort order to change
  • unique constraint may no longer work when sort order changed upon operating system upgrade, has to identified and removed duplicate manually and do a REINDEX when operating system upgraded.
  • introduced a torture test suite, that can be used if the current collation updates causes your sort order to change
  • sort order change also caused FDW foreign queries to fail if foreign and local servers have different collation versions
  • partitions may also have issues and cause a row to be categorized under the wrong partition
  • any data insertion would immediately cause index to go corrupt if its sort order has changed
  • introduced libcompatcollaction library tool that will extract the important collation changes from current machine’s libc library, and create another shared library. With this new shared library you can pre-load this before building postgresql, so you can ensure that the postgresql does not depend on the collation provided by your glibc, but rather controlled by postgresql itself. This would eliminate all the collation problems.

Technical Overview to Unlock Heterogeneous Databases (MySQL, SQLServer…etc) Replication to PostgreSQL – Cary Huang

Last but not least, this is my talk about our approach to achieve a seamless heterogeneous databases replication to PostgreSQL. The open-source project name is SynchDB, which is a PostgreSQL extension built on Java-based Debezium Embedded Engine to achieve logical replication from other database types such as MySQL, SQLServer…etc to PostgreSQL.

With the help of JAVA JNI libraries, we can easily spin up a JVM on a PostgreSQL backend worker. This JVM runs embedded Debezium engine which already has support for logical replication from a list of different database types. It produces a generalized JSON structure that represents change data from different database types. With JNI, the PostgreSQL backend worker can obtain these JSON structure from the JVM, process and transform them to a data structure that PostgreSQL understands and finally apply them.

Traditionally, to achieve heterogeneous logical replication, it requires a lot of setups to make it work, Debezium itself and Oracle GoldenGate are some examples that act as middleware to facilitate heterogeneous replication. But you know, they are complex to set up, expensive to use and try to support multiple source databases and multiple destination databases. This may not sound like a bad thing, but is the main source of complexity… they all try to support everything. This is why we would like a tool / solution that is more tailored to PostgreSQL as a destination database.

There are several challenges with SynchDB discussed in the talk, the largest one being the Java based Debezium Embedded Engine. Working with 2 very different memory management systems itself is a challenge. I also shared SynchDB’s architectural decisions, major functionalities and error handling mechanism. If you are interested in learning more you can visit our github repository here. It is open-source so feel free to use it and contribute if you can.

Summary

The Seattle PostgreSQL Conference was great. There were a lot of talks about data migration and compatibilities problems between Oracle and PostgreSQL which were valuable for both PostgreSQL and Oracle DBAs. The upcoming PostgreSQL conference for 2025 will be in Orlando, US. Check it out here.