Overview of PostgreSQL Foreign Data Wrapper (FDW)

Enterprise PostgreSQL Solutions

Comments are off

Overview of PostgreSQL Foreign Data Wrapper (FDW)

Introduction

A Foreign Data Wrapper (FDW) in PostgreSQL is an extension that allows you to access and manipulate data stored in external data sources as if they were tables within your PostgreSQL database. FDWs enable PostgreSQL to integrate with various data storage systems, both relational and non-relational, and present the data in a unified manner within the PostgreSQL environment. The act of splitting a table into multiple smaller tables residing remotely is also called sharding. The foreign database node is sometimes referred as foreign shards or simply shards. In this blog, we will explore what is happening inside postgres_fdw when processing a foreign select query.

A little bit history here:

  • In 2003, a new specification called SQL/MED (SQL Management of External Data) was added to the SQL standard.
  • Developed the SQL standard for remote access to databases.
    In 2011, PostgreSQL version 9.1 was released and began to support this standard read-only operation.
  • In 2013, write support was also added.
  • There are currently many different FDWs available that allow PostgreSQL to connect to a variety of remote data stores (including other RDBMS to flat files, etc.).
  • Most of these FDWs do not have official support from the PostgreSQL Global Development Group (PGDG), and some of these projects are still in beta. Please use with caution!

Some documentation related to FDW:

Set up a Simple FDW Environment

Ensure that postgresql_fdw extension is compiled and installed to your PostgreSQL binary installation path prior to the setup. We will set up a sharding envionment consisting of 1 coordinator node and 4 foreign data nodes.

Initialize all database instances

$ initdb -D c1
$ initdb -D d1
$ initdb -D d2
$ initdb -D d3
$ initdb -D d4

Edit the configuration files of d1 to d4 nodes and change their port numbers to 5433, 5434, 5435, 5436 while keeping c1 node running at default port 5432. The reason for this is that we will run all of them on the same machine for demonstration purposes.

On C1 node:

postgres=# CREATE EXTENSION postgres_fdw;

postgres=# CREATE SERVER s1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname  	'postgres', host '127.0.0.1', port '5433’);
postgres=# CREATE SERVER s2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname  	'postgres', host '127.0.0.1', port '5434');
postgres=# CREATE SERVER s3 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname  	'postgres', host '127.0.0.1', port '5435');
postgres=# CREATE SERVER s4 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname  	'postgres', host '127.0.0.1', port '5436');

postgres=# CREATE USER MAPPING for postgres SERVER s1 OPTIONS(user 'postgres’);
postgres=# CREATE USER MAPPING for postgres SERVER s2 OPTIONS(user 'postgres’);
postgres=# CREATE USER MAPPING for postgres SERVER s3 OPTIONS(user 'postgres’);
postgres=# CREATE USER MAPPING for postgres SERVER s4 OPTIONS(user 'postgres’);

postgres=# CREATE TABLE t (a INT, b TEXT) PARTITION BY RANGE;
postgres=# CREATE TABLE t_local PARTITION OF t FOR VALUES FROM (1) TO (1000);

postgres=# CREATE FOREIGN TABLE t_s1 PARTITION OF t FOR VALUES FROM (1001) TO (2000) 	SERVER s1 OPTIONS(schema_name 'public', table_name 't’);
postgres=# CREATE FOREIGN TABLE t_s2 PARTITION OF t FOR VALUES FROM (2001) TO (3000) 	SERVER s1 OPTIONS(schema_name 'public', table_name 't');
postgres=# CREATE FOREIGN TABLE t_s3 PARTITION OF t FOR VALUES FROM (3001) TO (4000) 	SERVER s1 OPTIONS(schema_name 'public', table_name 't');
postgres=# CREATE FOREIGN TABLE t_s4 PARTITION OF t FOR VALUES FROM (4001) TO (5000) 	SERVER s1 OPTIONS(schema_name 'public', table_name 't’);

On all data nodes:

postgres=# CREATE TABLE t (a INT, b TEXT);

There you have it. A simple setup using postgres_fdw. When data is inserted to table t on c1 node, the data could either be stored locally at c1 (for range 1 to 1000) or be distributed to foreign nodes (d1 ~ d4) based on the range values.

Query Data From Foreign Nodes

Postgres_fdw does not simply forward your SELECT queries and send them to foreign nodes to process and wait for responses. The performance of this is extremely low. Just like a regular SELECT query, foreign query also needs careful analysis and planning to ensure performance. A remote query normally would involve the following:

  • Analyzer – Create query tree
    • Create a query tree based on foriegn table definitions in pg_catalog.pg_class and pg_catalog.pg_foreign_table.
  • Connect to foreign server
    • Planner obtains network and user mapping information from pg_catalog.pg_foreign_server and pg_catalog.pg_user_mapping.
    • Connect to the remote server through FDW.
    • Postgres_fdw uses the libpq to connect to a remote PostgreSQL database.
  • Planning
    • Planner uses the EXPLAIN command to create a plan tree.
    • postgres_fdw supports using EXPLAIN to obtain statistics on remote tables, but you can also use local default values.
    • EXPLAIN returns the startup and total cost of the remote server, allowing Planner to accurately evaluate the statement.
    • The use_remote_estimate parameter can be modified through ALTER SERVER (default is off).
  • Deparse
    • The plan tree generated by Planner has obtained the scan path from the remote server through FDW.
    • Postgres_fdw is responsible for parsing the Plan Tree and re-shaping a set of SQL statements suitable for the remote server.
  • Extended Query Protocol + Cursor
    • postgres_fdw uses extended query protocol with the help of cursor to retrieve or fetch from remote nodes.
    • It uses isolation level repeatable read to ensure that the data is consistent during the transaction and will not be altered by another concurrent transaction.

Involved FDW API

Before accessing the remote server to execute the query, Planner calls IsForeignScanParallelSafe(), GetForeignPlan() and GetForeignPaths() to understand the scale of the remote data, and determine whether to make WHERE conditions to be evaulated locally or externally, etc.
When ready, use BeginForeignScan and iterateForeignScan to get one or more rows of data.
Finally, resources are released through EndForeignScan, including allocated memory, open files, connections to external data sources, etc.

  • IsForeignScanParallelSafe()
    • Whether to support Parallel Scan. If supported, multiple worker processes can be started to execute in parallel when performing a SELECT on a large amount of data.
  • GetForeignRelSize()
    • Gets an estimate of the size for a remote table. Planner calls it when before a scan starts.
  • GetForeignPaths()
    • Obtain all possible execution paths for remote table scanning and add each scan path to the scan path list.
  • GetForeignPlan()
    • Creates a ForeignScan plan node from the selected access path at the final stage of query planning.
    • This function is very important and contains the list of targets to be emitted, the WHERE statement to be executed, and who will execute it.
  • BeginForeignScan()
    • Prepare to perform a remote scan. Responsible for the necessary initialization work before the scan starts.
  • IterateForeignScan()
    • Gets a row of data from the remote node and returns it as a tuple slot, returning NULL at the end.
  • EndForeignScan()
    • End the scan and release resources. Release memory allocated by palloc, clean up open files and connections to remote servers.

Summary

This is a quick overview of the inner workings of postgres_fdw covering the very basics of select query. We will continue to discuss other features of FDW in future blog posts.