Bulk loading into PostgreSQL: Options and comparison

Enterprise PostgreSQL Solutions

Comments are off

Bulk loading into PostgreSQL: Options and comparison

You have a file, possibly a huge CSV, and you want to import its content into your database. There are lots of options to do this but how would you decide which one to use. More often than not the question is how much time would the bulk load would take. I found my self doing the same few days back when I wanted to design a data ingestion process for PostgreSQL where we needed to bulk load around 250GB of data from CSV files every 24 hours.

Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server. So I decided to do a simple comparison of bulk loading options and techniques.

In short I wanted to see the performance difference of loading the data into standard vs unlogged tables and want to compare the loading time difference between loading into table that has an index vs drop-index->load->recreate-index option.

Moreover, I wanted to see the performance difference of COPY command, client-side copy command, loading through file_fdw, and pg_bulkload for each of the above options.

Database and system settings

Since the intention was to do a relative performance comparison among different data loading techniques and options, so using the personal MacBook Pro running macOS Catalena with 16GB of RAM, 2.7 GHz Quad-Core Intel Core i7 processor, and 500 GB SSD disk was good enough to serve the purpose.

For database I compiled PostgreSQL v12 from source code with default configure options. I left most of the configuration parameter to their default values and only changed the below mentioned settings.

shared_buffers = 2GB
work_mem = 400MB
maintenance_work_mem = 640MB

Sample data and schema

For the purpose of this exercise, I downloaded a sample CSV file from http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ with 5million rows.

The sample CSV file contains 5 million rows, 14 columns and 624MB in size.

To keep things simple I created a sales_record table in PostgreSQL with one to one mapping with the CSV file

CREATE TABLE sales_record 
	region VARCHAR,
	country VARCHAR,
	item_type VARCHAR,
	sales_channel VARCHAR,
	order_priority CHAR,
	order_date DATE,
	order_id INT,
	ship_date DATE,
	unit_sold INT,
	unit_price FLOAT,
	unit_cost FLOAT,
	total_revenue FLOAT,
	total_cost FLOAT,
	total_profit FLOAT

Along with that I also wanted to see the impact of having an index on the bulk load performance, So for tests that require an INDEX, I created a btree index on the country column.

CREATE INDEX country_idx ON sales_record USING btree (country);

Load using the COPY command

COPY moves data between PostgreSQL tables and standard file-system files. The copy command comes in two variants, COPY TO and COPY FROM. The former copies the table content to the file, while we will use the latter to load data into the table from the file.

COPY sales_record FROM '/Users/muhammadusama/work/data/5m_Sales_Records.csv' CSV HEADER;

Load using psql ‘\copy’

\copy‘ is a psql operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql (client) reads or writes the file and routes the data between the server and the local file system. This means that file accessibility and privileges are those of the local user, not the server, and no SQL superuser privileges are required.

\copy sales_record FROM '/Users/muhammadusama/work/data/5m_Sales_Records.csv' csv header;

Through file_fdw

The foreign-data wrapper file_fdw, can be used to access data files in the server’s file system, or to execute programs on the server and read their output. We can also use the file_fdw to load data from CSV to PostgreSQL tables.

-- Create file_fdw extension and foreign server
CREATE SERVER file_fdw_server FOREIGN DATA WRAPPER file_fdw; 

-- Define the foreign table that points to our CSV file
CREATE FOREIGN TABLE foreign_sales_record (
	region VARCHAR,
	country VARCHAR,
	item_type VARCHAR,
	sales_channel VARCHAR,
	order_priority CHAR,
	order_date DATE,
	order_id INT,
	ship_date DATE,
	unit_sold INT,
	unit_price FLOAT,
	unit_cost FLOAT,
	total_revenue FLOAT,
	total_cost FLOAT,
	total_profit FLOAT) SERVER file_fdw_server
			format 'csv',
			header 'false' ,
			filename '/Users/muhammadusama/work/data/5m_Sales_Records.csv',
			delimiter ',',
			null '');

-- Copy the data from foreign table to local table
INSERT INTO sales_record SELECT * from foreign_sales_record;

Although file_fdw is not expected to be as fast as COPY command when it comes to loading the data but it provides a lot of flexibility and options when it comes to pre-processing the data before loading.

pg_bulkload tool

pg_bulkload is also a very interesting option when it comes to high speed data loading. Its an open-source tool that achieves its performance by skipping the shared buffers and WAL logging.

-- CREATE pg_bulkload extension
$ bin/psql -c "CREATE EXTENSION pg_bulkload" postgres

-- Create control file with appropriate contents
$ more sample_csv.ctl 
OUTPUT = public.sales_record  # [<schema_name>.]table_name
INPUT = /Users/muhammadusama/work/data/5m_Sales_Records.csv  # Input data location (absolute path)

TYPE = CSV           # Input file type
QUOTE = "\""         # Quoting character
ESCAPE = \           # Escape character for Quoting
DELIMITER = ","      # Delimiter

-- Execute pg_bulkload utility
$ bin/pg_bulkload -d postgres -h localhost sample_csv.ctl


Below chart shows the time taken by each tool/command to load 5 million rows from CSV file


Each method for data loading has its own pros and cons which may make one preferred choice over others for a particular use case. But when it comes to raw performance pg_bulkload is a clear winner with COPY and /copy line up behind while file_fdw stands at the last place.

While no matter which data loading method we use, loading into an indexed table is always slow, So do consider drop-index->load->create-index when you have a huge data to be loaded.

Comparison of all the tools was an apple to apple comparison with both client and server were running on the same machine. So, /copy had no network overhead. In the case of PostgreSQL server and client are on different machines the /copy command may not perform as well as these above results.