Getting started with pg_bulkload

Enterprise PostgreSQL Solutions

 

pg_bulkload is a high-speed data loading tool for PostgreSQL, compared with the copy command. The biggest advantage is speed. In the direct mode of pg_bulkload, it will skip shared buffer and WAL buffer, writing files directly. It also includes the data recovery function to recover in case of import failure.

GitHub: https://github.com/ossc-db/pg_bulkload

pg_bulkload mainly includes two modules: reader and writer. The reader is responsible for reading files and parsing tuples, and the writer is responsible for writing the parsed tuples to the output source.

Environment

  • PostgreSQL version:12.3 stable
  • pg_bulkload version: 3.1.17
  • dbname: bldemo,user: postgres

Installation

When compiling in the test environment, it is found that the following components need to be installed in advance:
1. gcc
2. openssl-devel
Then it can be installed normally:

$ cd pg_bulkload
$ make USE_PGXS=1
$ su
$ make USE_PGXS=1 install

Then, there are two ways to register the function into the database:

1.Execute the SQL file

[root@localhost pg_bulkload-3.1.17]# find / -name pg_bulkload.sql
find: /run/user/1000/gvfs: Permission denied
/root/pg_bulkload-3.1.17/lib/pg_bulkload.sql
/opt/HighGo4.5.6-see/share/postgresql/extension/pg_bulkload.sql
# Here the two pg_bulkload.sql files is the same

The contents of the pg_bulkload.sql are as follows:

/*
 * pg_bulkload.sql
 *
 *    Copyright (c) 2007-2021, NIPPON TELEGRAPH AND TELEPHONE CORPORATION
 */

-- Adjust this setting to control where the objects get created.
BEGIN;

CREATE SCHEMA pgbulkload;

CREATE FUNCTION pgbulkload.pg_bulkload(
        IN options text[],
        OUT skip bigint,
        OUT count bigint,
        OUT parse_errors bigint,
        OUT duplicate_new bigint,
        OUT duplicate_old bigint,
        OUT system_time float8,
        OUT user_time float8,
        OUT duration float8
)
AS '$libdir/pg_bulkload', 'pg_bulkload' LANGUAGE C VOLATILE STRICT;

COMMIT;

If it is created directly, an error will be reported:

[root@localhost pg_bulkload-3.1.17]# psql -U bulkload -d bldemo -f /opt/postgres/share/postgresql/extension/pg_bulkload.sql

BEGIN
CREATE SCHEMA
psql:/opt/postgres/share/postgresql/extension/pg_bulkload.sql:23: ERROR:  permission denied for language c
ROLLBACK

When the above problem occurs, you need to use the administrator user to execute ‘update pg_language set lanpltrusted = true where oid = 13;’ in the database.

Then you can create the function normally:

[root@localhost pg_bulkload-3.1.17]# psql -U postgres -d bldemo -f /opt/HighGo4.5.6-see/share/postgresql/extension/pg_bulkload.sql


BEGIN
CREATE SCHEMA
CREATE FUNCTION
COMMIT

2.Create extensions in the database

bldemo=# create extension pg_bulkload;
CREATE EXTENSION

Import data

In the reference document, it used ‘seq 100000| awk ‘{print $0″|lottu”}’ > tbl_lottu_output.txt’ to generate data, here we used ‘dstat –outpub dstat.csv’.

The SQL statement for creating table:

create table dstat(usr numeric, sys numeric, idl numeric, wai numeric, hiq numeric, siq numeric, read numeric, writ numeric, recv numeric, send numeric, "in" numeric, "out" numeric, "int" numeric, csw numeric);

1.Import data directly:

[root@localhost ~]# pg_bulkload -i dstat.csv -O dstat -l dstat.log -P dstat_bad.txt -o "TYPE=CSV" -o "DELIMITER=," -d bldemo -U postgres

NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        1522 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        0 Rows replaced with new rows.

Clean up table data before importing

bldemo=> select count(*) from dstat ;
 count
-------
  1522
(1 row)

bldemo=> copy dstat from '/root/dstat.csv' with csv header;
ERROR:  must be sysdba or a member of the pg_read_server_files role to COPY from a file
HINT:  Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.
bldemo=> \copy dstat from '/root/dstat.csv' with csv
COPY 1522
bldemo=> select count(*) from dstat ;
 count
-------
  3044
(1 row)
[root@localhost ~]# pg_bulkload -i dstat.csv -O dstat -l dstat.log -P dstat_bad.txt -o "TYPE=CSV" -o "DELIMITER=," -o "TRUNCATE=YES" -d bldemo -U postgres

NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        1522 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        0 Rows replaced with new rows.
bldemo=> select count(*) from dstat ;
 count
-------
  1522
(1 row)

2.Import data using control files

Create a new control file:

INPUT = /root/dstat.csv
PARSE_BADFILE = /root/dstat_bad.txt
LOGFILE = /root/dstat.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = ,
QUOTE = "\""
ESCAPE = "\""
OUTPUT = public.dstat
MULTI_PROCESS = NO
VERBOSE = YES
WRITER = DIRECT
DUPLICATE_BADFILE = /opt/postgres/data/pg_bulkload/20210414110538_bldemo_public_dstat.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = NO

Import data:

bldemo=> select count(*) from dstat ;
 count
-------
  1522
(1 row)
[root@localhost ~]# pg_bulkload /root/dstat.ctl -d bldemo -U postgres

NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        1522 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        0 Rows replaced with new rows.
bldemo=> select count(*) from dstat ;
 count
-------
  3044
(1 row)

Epilogue

This is the documentation given in the pg_bulkload document, now we’ve learned how to use pg_bulkload, enjoy it!