PostgreSQL database writes all changes happening in the database to a log file before it writes them to the actual data files. These log files are called WAL (Write-Ahead Log). Until these changes are flushed to the disk, they are kept in memory and is returned from the memory when it’s asked for. Writing to the data files directly is a costly operation for several reasons and is performed periodically. Writing to a WAL file, on the other hand, is not that expensive because the log is a sequential record of all changes, meaning records are only appended to the log and are not searched, updated, etc.
While this ensures that in the event of server failure, the lost changes can be recovered from the log file. On the other hand, these log files can consume quite a lot of disk space, and re-applying the log can also become a time-consuming task if there are hundreds of log files available to re-apply before the system can achieve a consistent state.
In order to reduce this time and reduce the number of WAL files needed to achieve a consistent state, the PostgreSQL database has a mechanism called checkpoints.
The checkpoint is a process that ensures that all the information that has not made it to the data files, is flushed onto the disk at some regular intervals while ensuring that the database system is not put under a lot of pressure and compromises the system performance.
A checkpoint also marks (or creates) a location, which guarantees that all the information has been flushed to the disk and the log files prior to that point is no longer needed, if the system goes down after that point.
The checkpoint records this location/point to the log files as well. So when the system crashes and brought back, it will look for the recent checkpoint location in the logs and will re-apply the logs from that location onwards. This way, it can reduce the burden of maintaining a lot of log files. Since the checkpoint flushes all the changes to the disk, it adds quite a lot to system I/O as well. So in order to not put a lot of burden on the system creating a checkpoint is controlled by the system and users can change these settings to accommodate as per their requirements.
What Checkpoint Accomplishes?
When a checkpoint is started it generally accomplishes the following tasks:
- A WAL record of this checkpoint is recorded in the log
- All dirty pages of data are writing to disk.
- The checkpoint is written to the pg_control file.
- Synchronizes all data files.
How Checkpoint is started?
There are many ways to generate a checkpoint.
- max_wal_size – configuration parameter when it reaches the limit.
- checkpoint_timeout – configuration parameter this expires
- When a backup is started either using
- pg_start_backup is issued
- Database server shutdown in fast or immediate mode.
- CHECKPOINT – an administrative command is issued.
How Checkpoint is controlled?
These are several configuration parameters that can be used to control the behavior of the checkpoint generation.
This parameter tells how much log files can accumulate before a checkpoint can be generated. The default is 1GB. So when this limit is reached, the system will generally do a checkpoint.
This is the time interval between two checkpoints. If this interval is crossover then a checkpoint will be made. The default is 5 minutes and can be changed between 30 seconds to a day.
When a checkpoint is generated, It can put quite a burden on I/O load, so to avoid flooding the I/O system, during the checkpoint, writing to disk is spread over a period of time, as a fraction of total time between checkpoints. This period of time is controlled by this parameter. If its set to 0.5, the checkpoint is expected to complete whenever (max_wal_size x 0.5) or (checkpoint_timeout x .5) reaches. The default value is 0.5 and can be changed between 0 and 1.
If the log files are being filled close to the amount of time set by this parameter, then generate a warning to server logs. This tells us that files are being generated too often and max_wal_size is ought to be increased.
In general, the checkpoint is generated whenever max_wal_size or checkpoint_timeout is reached, whichever comes first. Increasing these parameters have a direct effect on disk space used for storing the log files and the time it would take to recover from a crash. The higher the values the higher disk usage, and time it would take to re-apply the logs. However, if you lower these settings, it would affect the performance of the system. So care must be taken in figuring out which values to use for these parameters.
Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.