This is the second of the two-part series of blogs about PostgreSQL data corruption. Part-1 discussed identifying the data corruption in PostgreSQL and lists some DO’s and DON’Ts for avoiding that. In this part, I will attempt to walk you through possible ways to recover from the corruption and ways to salvage data in case of unfortunate events. One of the main objectives of this blog is to provide the readers with tips on how corruption can be avoided in the first place. In case corruption has happened, the blog will guide you on how to minimize the impact.
Every data corruption is unique in its own way and there is no one-size-fits-all kind of solution/process that exists for dealing with the corruption. More often than not it is not possible to reliably recover 100% of data once it has got corrupted. So the ideas discussed in this post are more of pointers to the direction to salvage as much data as possible in case of an unfortunate event.
Life is easier if you have backups
The only guaranteed way of recovering the maximum data in case of data corruption is to restore from a recent backup. So always design a disaster recovery strategy where you always have a recent copy of the database available that can be restored or failover if the primary database server becomes corrupted. In case of data corruption on the standby server, discarding and re-building the standby is the best solution.
Always work on a copy
Ok, you do not have a backup and all you have is a corrupted database, The very first thing you must do is to make a copy of the data directory. Because it is very likely that you end up making things worse and cause more damage in the process of trying to recover the data.
So the best approach is to always work on a copy and in case something goes wrong in the process of recovering the corrupted data you always have a backup to start fresh.
Done with hacking: do pg_dump and pg_restore
How will you know that you have taken care of all data corruption in your database? So your goal must be to use the pg_dump to backup the database contents and restore them into a new instance (Created using initdb). The reason being, it’s the best way to verify that corruption recovery was successful and there are no hidden structural or logical errors. Otherwise, there is a good chance that the database that you just repaired, appears okay, but contains issues that will cause you problems in the future.
What if corruption in database can’t let you start the DB?
Let’s admit it. It’s a bad situation. Since you wanted to do a pg_dump to recover data from a corrupted server, the database is so badly corrupted that it refuses to start. Well, few things can be done in this situation.
It depends on the type of corruption, But if the database server is refusing to start, most of the time it’s because of corrupted write-ahead-log (WAL) or control data (pg_control file contents). pg_resetwal utility is very handy for dealing with this kind of corruption and running this tool mostly allows the damaged database to start. Once the database gets up and running you can start with using the pg_dump step to fix/verify the corruption.
No hope? Try pg_filedump
If your database becomes so stubborn, it won’t start even after applying every tool and trying out all possible techniques. The last resort is to use pg_filedump utility. Well, it can’t fix your database but you can recover some of your data if not all by extracting the data from raw database files using this utility.
pg_dump is failing
You have given a lot of beating to your database files, ran a couple of tools and utilities, and finally got it running. But now pg_dump is in no mood to give you a dump. Well, that’s a headache and will test your patience. In that case, all you need to do is identify an object(s) and pinpoint the portion of data causing the issue and take an appropriate action depending upon the object/issue.
Here is a quick cheatsheet for the reference
- If it’s an index. Just go on and drop it or do an index.
- If the pg_dump on the whole database is failing. Do individual tables and schema dumps.
- If it’s failing on some particular object that can be dropped or recreated afterward, consider doing that.
- If it’s failing on a particular TABLE, try to locate the set of problematic rows and filter those out.
Finally pg_restore can fail too
Well, in bits and pieces we have managed to dump our database. We did lose a few rows and had to delete a couple of indexes and objects, but we have successfully extracted 99.9% of our corrupted database server data. We have made a promise to ourselves to do regular and frequent backups in the future. Now, we want to restore the dump we extracted, but add to our bad luck now restore fails. Well, No need to get panic. The most probable cause of this restore failure could be a constraint violation, either a foreign key or unique constraint, or some missing object that can be easily gotten around by a person who knows that schema and business logic.
As I said at the start of the post that each data corruption is unique in itself and sometimes it’s recoverable and other times it’s not. So the best way is to avoid getting into that situation where you need to scavenge the data from the corrupted database and you must always have a ready and latest backup handy to switch to in case the primary gets corrupted.
Muhammad Usama is a database architect / PostgreSQL consultant at HighGo Software and also Pgpool-II core committer. Usama has been involved with database development (PostgreSQL) since 2006, he is the core committer for open source middleware project Pgpool-II and has played a pivotal role in driving and enhancing the product. Prior to coming to open source development, Usama was doing software design and development with the main focus on system-level embedded development. After joining the EnterpriseDB, an Enterprise PostgreSQL’s company in 2006 he started his career in open source development specifically in PostgreSQL and Pgpool-II. He is a major contributor to the Pgpool-II project and has contributed to many performance and high availability related features.