Avoiding, Identifying, and dealing with PostgreSQL database corruption – Part 1

Enterprise PostgreSQL Solutions

3 comments

Avoiding, Identifying, and dealing with PostgreSQL database corruption – Part 1

All the things that can happen to databases, corruption is not the one that happens frequently, but when it does it can result in major downtime and data loss. Probably it is one of the most terrifying things for a DBA to come across and the best way to deal with it is to not even deal with it in the first place.

So what is database corruption? 

Database corruption can be defined as a problem associated with unintended modification of original data or improper storage of actual data. It is referred to error in database that happens during reading, writing, storing or processing of unintended changes to original data.

The symptoms of the corrupted database can range from getting a WARNING message in a log file to database server crashing and refusing to startup.

In this blog post, I will try to list some administrative measures and best practices that we can follow to avoid the PostgreSQL database from getting corrupt in the first place and the subsequent post will talk about the options we have to recover from a corrupted database.

Causes of database corruption.

PostgreSQL databases, or for that matter any database can get corrupted for a variety of reasons, Some of those are avoidable while others can be mitigated. There are three main causes of database corruption that ranges from bad hardware/software, misconfigurations, and user errors.

Bad Hardware

Bad hardware is one of the top causes of database corruption. Most commonly it’s the bad-disk and rarely the culprit turns out to be a bad memory. When it’s a problematic hard disk and the data that we previously wrote is no more there or becomes different. More often it results in a database crash and sometimes gives other symptoms as well.

While bad-memory is very unusual but not impossible. When that happens the data gets corrupted even before it gets written to the disk and eventually corrupted data gets stored on the disk and it usually gives similar symptoms as corrupted disks.

Bad Software

The second big cause of database corruption is bad software. All software has bugs and there are lots of software that needs to work perfectly to produce and store the correct data. It is possible that data corruption can trace back to a software bug. It could be a bug in the PostgreSQL server itself, or the operating system, or a filesystem. For example, zfs on Linux is known to have data corruption bugs.

Misconfigurations and User errors

Out of all the known causes of data corruption, the most common one is the user errors and misconfigurations. There are several things that we can do wrong which can result in database corruption and this category of database corruption can be easily avoided by following best practices.

Symptoms of a corrupted database

Corruption by its nature is one of a thing, there exists no specific error that gets thrown by PostgreSQL when the database becomes corrupted. So you need to carefully lookout for the signs to detect it. Sometimes the data corruption problems are easily identifiable while other times they can go on undetected till they drive us crazy.

Some of the problems that can give the hint of database corruption are.

Unexpected Query Results

  • The user inserted a data in table and subsequent select on table returned altogether a different/invalid result.
  • User who executes a particular query, part of his regular workflow suddenly notices an unexpected reduction in the number of rows.
  • Query that was working fine now went into infinite loop.

These kinds of problems hint at database corruption. Obviously, there could be other reasons but if the investigation on these unexpected results turns out to be a problem with the contents of the actual data stored in the file rather than some bug or user action, only then it is database corruption.

Start getting deleted or duplicate rows.

If you notice that SQL queries started returning deleted or duplicated rows, rows from the rollbacked transactions, then this could be a database corruption and needs immediate attention,

System crashes attempting to read data

Process crashes while trying to read data could be caused by a variety of reasons including the corrupted database.

Hints of corruption in log messages

Be vigilant about Error messages that complain about internal things rather than user-visible things.
For example: below are some error messages that give the hint of database corruption 

ERROR:  could not read block 274179 in file "base/13642/24643.2": read only 0 of 8192 bytes

ERROR:  could not access status of transaction 3250922107
DETAIL:  Could not open file "pg_xact/0C1C": No such file or directory.

ERROR: failed to re-find parent key in index "pg_attribute_relid_attnum_index" for split pages 6424/6425 

PANIC could not locate a valid checkpoint record

ERROR:  found multixact 68834765 from before relminmxid 73262006

Error:  failed to re-find parent key in the index "xxx" for split pages yyy/zzz

ERROR:  cache lookup failed for type 16292881

PANIC could not locate a valid checkpoint record

Best Practices to avoid data corruption

So let’s now discuss how we can avoid or at least quickly detect database corruption.

Backup and Restore

Take regular backups.

The first and foremost practice you need to follow is to take regular backups. Taking a backup serves two purposes. First, it gives us a copy of data that we can use for disaster recovery, secondly successful backup means that all of your data is readable. Which is the first checkpoint in maintaining that the database is not corrupted.

Test by restoring the backup.

The backup that is not restorable is not a backup. So it is important to test each backup by restoring it and starting the server on the restored backup and verifying the contents. It may not be possible to verify every single byte of the restored database but you can write or use some script that can do the sanity checking of the restored database.

Configurations

There are three database settings that are important for preventing the database corruption.

Use fsync=on

fsync ensures that all data is flushed to the disk so that all changed information can be retrieved even if the system crashes or is rebooted. Although you may feel a slight performance impact of keeping the fsync=on but turning it off would mean it is highly likely that you will get a corrupted database in case of an operating system crash.

Use full_page_writes = on

In theory, if the operating system’s filesystem cache uses the 8k or larger chunk size then we don’t require a full_page_write to be turned on. But practically we can end up in situations where we get a partial page written to disk in an event of a system crash or power failure. So it is highly recommended to keep the full_page_writes to guard against data corruption.

Figure out the right value for wal_sync_method

Wal_sync_method has multiple options to choose from. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. Refer to the https://www.postgresql.org/docs/current/wal-reliability.html for selecting the correct value for the parameter for your operating system and also check pg_test_fsync output to try out the configured value (hint: The correct method would not be the fastest one)

Use checksum enabled

You want to detect corruption as early as possible to ensure that you have the best chance of fixing it and stop it from spreading. So it is highly recommended to use checksums since in case of data corruption the checksum fails and you will get an error straight away.

To enable checksums you need to do initdb with -k switch and from PG 12 onwards you can use the pg_checksums -e to enable the checksums on the existing data directory.

Storage

The next area of best practices is to choose a good storage system and filesystem. For example, relying on a single disk and hoping it will never get failed is not a wise option. So using RAID is a good practice. Similarly, exercise caution when selecting the filesystem, there are filesystems that have known data corruption problems, avoid using those. Ext4 and xfs on Linux are known to be more reliable than other filesystems. So double-check all your storage and filesystem selections. 

Administration

Finally lets go through some of the DOs and DON’Ts to avoid the data corruption 

Plug-pull test your system

Before going live Plug-pull test your system. Put it under load with something like, pgbench, then literally pull the power plug out, so your server loses power and shuts down uncleanly. If your database doesn’t come back up fine you have hardware, OS, or configuration problems. Do this repeatedly while under load, until you’re confident it’ll come up reliably.

Never Modify the data files manually

Everything in the PostgreSQL data directory is absolutely vital and should never be messed with directly. So except for conf files, it’s not wise to directly modify any file part of the data directory.

Do not delete postmaster.pid

Deleting the postmaster.pid file can allow multiple servers to run on the same data directory. And this is the quickest recipe for getting a corrupt database.

Do not run Antivirus on the Data directory

Antivirus can modify or delete files and that is absolutely not recommended to delete/modify any PostgreSQL data files. So it’s best to avoid running antivirus software on the system that is hosting the PostgreSQL database.

Avoid Kill-9 any PostgreSQL process

kill -9 is not smart and should be the last resort; first, use pg_cancel_backend then pg_terminate_backend before thinking about kill -9

Conclusion

The post lists some DOs and DON’Ts for avoiding data corruption and talks about the causes and symptoms of the database corruption. In the second part, I will discuss the possible ways to recover from the corruption and ways to salvage data in case of unfortunate events.

3 Responses

  1. […] post Avoiding, Identifying, and dealing with PostgreSQL database corruption – Part 1 appeared first on Highgo Software […]

  2. muhammad usama slava says:

    >For example, zfs on Linux is known to have data corruption bugs
    what exactly are you talking about there?

Leave a Reply

Your email address will not be published. Required fields are marked *