Distributed Database With PostgreSQL – Atomic Commit Problems

Enterprise PostgreSQL Solutions

Comments are off

Distributed Database With PostgreSQL – Atomic Commit Problems

1. Introduction

If you are into distributed database research, especially one that is setup using Foreign Data Wrapper (FDW) + partitioned foreign tables, you probably have heard that there are many potential issues associated with this setup. Atomic commit, atomic visibility and global deadlock detection are one of the most popular issues that one can encounter in a distributed database environment.

There has been a lot of discussion on the PostgreSQL community on atomic commit and potential ways to solve this problem, and also discuss how this issue introduces an atomic visibility issue.

I recently had a chance to scan through several of the community discussion on atomic commit and would like to share what I have learned in this blog.

2. The problem

In a distributed environment, if a query involves modification to multiple foreign tables (or shards), current PostgreSQL cannot guarantee either all of them commit or all of them rollback. This means that, if one of them fails to commit for some reason but the rest have succeeded, we will see inconsistent results. The desired behavior would be that if one of the foreign tables fails to commit, PostgreSQL should be able to roll back the ones that have been committed successfully.

However, with the current postgres_fdw and the transaction manager, it is currently not possible to “uncommit” a foreign transaction that has been committed. This is why most of the discussions generally regard PG’s “2 Phase Commit” as the basic building block to achieve atomic commit.

2. 2 Phase Commit (2PC)

2PC basically splits a regular commit action into 2 stages, a PREPARE TRANSACTION and a COMMIT PREPARED (or ROLLBACK PREPARED) stage. Doing this allows PG to uncommit an in-progress commit in case it finds a condition that all of the foreign transactions need to be aborted.

For example, 1 Coordinator Node (CN) sends a UPDATE query to 3 Data Nodes (DNs) and would like to commit

  • CN sends a PREPARE TRANSACTION to each DN1 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN2 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN3 and waits for response

If any of these PREPARE TRANSACTION (say DN3) fails to deliver or its response contains error, CN still has a chance to ROLLBACK transactions on DN1 and DN2 and DEALLOCATE the existing prepared transaction. After that, the CN will abort the local transaction and finish the transaction with error and nothing gets changed here.

Only if all 3 DNs have confirmed the PREPARE TRANSACTION success, CN can then proceed to commit the local transaction, if local succeeds, continue to send COMMIT PREPARED to each DN, or send ROLLBACK PREPARED if local fails

  • CN sends a COMMIT/ROLLBACK PREPARED to each DN1 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN2 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN3 and waits for response

If any of these fails to COMMIT PREPARED at this point, say (DN3), CN is not able to uncommit DN1 and DN2 that have been 2-phase committed, so at this point, CN can only retry indefinitely to deliver COMMIT PREPARED to DN3 to ensure atomicity. If DN3 is unreachable, CN will keep on trying forever without closing the transaction and leaving the client waiting.

If client choose to cancel the transaction by hitting ctrl-c, then the transaction will close with a warning message that DN3 has not confirmed a COMMIT PREPARED message and data may be inconsistent. DN3 at this point may have an orphaned prepared transaction that needs to be handled.

3. Orphaned Prepared Transaction

In the previous example, an orphaned prepared transaction can happen if CN prepares a foreign transaction but never has a chance to commit it (possibly due to network). In this case, the DN node can very easily detect an orphaned prepared transaction based on the time the prepared transaction has been created. Unfortunately, without a centralized Global Transaction Manager (GTM), DN node is not able to decide if this orphaned prepared transaction should be committed or discarded. Therefore, most of the solutions on the community only support the detection of orphaned prepared transaction without any handling.

4. How Atomic Visibility Appeared?

Using 2PC allows CN node to rollback a commit during the PREPARE stage but cannot allow rollback during the commit prepared stage, so it can only retry forever or exit with a warning. The visibility issue can happen in a case where DN1 and DN2 have already committed the prepared transactions while DN3 has not yet committed. At this point, if a client does a SELECT statement across DN1, DN2 and DN3, it will see that DN3 still does not have the right data but DN1 and DN2 have. This is not atomic at all.

To solve this, several people proposed a snapshot that can be shared globally across all the DN and CN nodes. The standard PG’s snapshot is purely based on transaction ID to determine if a tuple is visible or not, but with multiple different PG servers, each of them have their own transaction ID range and could all be very different, so sharing a snapshot based on transaction ID is impossible.

This is why a Commit Sequence Number (CSN) snapshot is being proposed in the discussion, which is based on time, and could be made the same in all of the CN and DN nodes and act as a reference point to determine if a tuple belongs to the past or future. The problem is then shifted to how to ensure all of the CN and DN nodes have consistently the same time? Even the best clock chipsets are subject to time skewing and will slowly go out of sync. The concept of Clock-SI is then added to the discussion, which is designed to handle such a clock skew. This makes the discussion somewhat more complicated and out of scope.

But for me, this may not be worth the effort to completely solve the atomic visibility issue with all these complicated stuff.

5. reference to some of the discussions

Transactions involving multiple servers
Propsal for CSN based snapshot
Global snapshot