Quite a few managers or clients expect DBAs to type away a few commands to fix a sluggish PostgreSQL instance. If only things were that simple. Unfortunately, debugging a sluggish instance that was previously running optimally requires analysis at many levels. In this blog, I will attempt to help that debugging process on Linux by identifying tools, tips and tricks that one can use when digging away a performance issue. I thought I could cover all the relevant topics in a single blog, but there is just far too much to write, so expect a series instead and this blog to be the first in the series of blogs.
Remember, a lot of things can go wrong, and it’s not always the database at fault. Applications, networks, or other factors are at play too. However, our approach will help identify the issue or remove PostgreSQL’s name from the charge sheet.
The Suspects
If we assume that there is an issue with the database server, we must start looking at the possible suspects that might be the troublemakers. Is it the hardware, the operating system, or the PostgreSQL instance? It’s a very simple list though:
- Operating System/Hardware
- I/O or Disk
- Memory
- CPU
- Network
- PostgreSQL
- Query
- Locks
- Checkpoint
The Approach
When trying to debug a problem, I tend to reduce the search space first. So this is no different in case of PostgreSQL. We’ll approach this problem by first eliminating some suspects that are innocent. After that, we’ll interrogate the remaining suspects.
Fortunately, some very basic and common Linux commands can help us do that. You can immediately identify and rule some culprits:
- Access: issues can easily be identified “ping”, “ssh” or “psql” commands;
- CPU Over Utilization: issues can be seen via the “top” command. Alternatively, the “uptime” command gives you the 1, 5, and 15 CPU load averages.
9:53 up 1 day, 22:24, 11 users, load averages: 1.32 2.27 3.08
- Memory: usage can be seen via the “free” command.
- Out of Disk Space: can be checked via the “df” command.
If all looks good, the problem must be elsewhere. Let’s jump to the next level now.
Checking Disk Performance
There are rarely any cases in production when the databases entirely fit into the memory. More often than not, lots and lots of I/O is done, both read and write, particular in case of an OLTP setup. On Linux, you can easily install additional tools to debug and analyse the disk load. The following packages may be installed via “APT” or “YUM” commands:
- dstat
- iotop
- sysstat
The statistics package is enabled by default when you install the packages. Check the documentation for your platform to enable it.
iostat
Although it can provide details about CPU usage, it is a very useful tool to understand disk usage.
iostat -dmx sda 1 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.06 0.89 1.37 18.03 0.04 0.63 70.75 0.04 1.95 0.24 2.08 0.05 0.09
The extended statistics for the disk displayed above provide some valuable insights. Where r/s w/s, rMB/s and wMB/s tell the read/write story, of particular interest are the values for “avgqu-sz”, wait columns and “%util”.
If the queue size has stacked to a significantly high value, it is a clear indication that the storage device is having a problem serving requests. Similarly, wait columns indicate the average time in milliseconds that a request spends in waiting. Another indication that the storage device is overloaded is “%util” value. This is an indication of the time the drive was doing at least one thing. This is fine for magnetic devices, however, it doesn’t accurately reflect when storage devices can handle many requests in parallel as is the case with SSDs and RAID. But it is still a valuable indicator to understand if the storage device has reached or nearing saturation.
sar
sar, short for System Activity Report, is a very useful tool. It is part of the sysstat package. It has tons of options that you use to monitor resource load. However, of particular interest is the option where you can analyze a log generated by the sysstat package (or see the load for a specific time period by specifying start and end times).
With logging enabled, from a database perspective, it gives one the power to understand how the system was behaving at a particular time by evaluating a log file. Say, the system was misbehaving at night from 2 am to 2:30 am on the 1st of February.
$ sar -f /var/log/sa/sa03 -b -s 02:00:00 -e 02:30:00 -r -S Linux 3.10.0-1160.11.1.el7.x86_64 (dr_x) 02/03/2021 _x86_64_ (32 CPU) 02:00:02 AM tps rtps wtps bread/s bwrtn/s 02:10:01 AM 1.30 0.11 1.19 0.55 16.93 02:20:01 AM 1.09 0.06 1.04 0.06 14.13 Average: 1.20 0.08 1.11 0.30 15.53 02:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 02:10:01 AM 713981872 78383572 9.89 1084 30076480 33264684 4.18 17525664 22328076 32 02:20:01 AM 713978052 78387392 9.89 1084 30077248 33260784 4.18 17529048 22328088 36 Average: 713979962 78385482 9.89 1084 30076864 33262734 4.18 17527356 22328082 34 02:00:02 AM kbswpfree kbswpused %swpused kbswpcad %swpcad 02:10:01 AM 4194300 0 0.00 0 0.00 02:20:01 AM 4194300 0 0.00 0 0.00 Average: 4194300 0 0.00 0 0.00
You can use sar to see the usage statistics for RAM, disk, CPUs, SWAP and network devices.
dstat
This is the best looking tool of the lot as you can see from the output:
$ dstat -cdngy ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 99 0 0 0| 43k 640k| 0 0 | 0 0 |1830 3988 0 0 100 0 0 0| 0 0 | 0 0 | 0 0 | 720 2531 0 0 100 0 0 0| 0 188k| 354B 1152B| 0 0 | 799 2120 0 0 100 0 0 0|1024B 15k| 140B 338B| 0 0 | 799 2761 0 0 100 0 0 0| 0 8192B| 70B 362B| 0 0 | 683 1974 0 0 100 0 0 0| 0 0 | 70B 370B| 0 0 | 744 2069 0 0 100 0 0 0| 0 0 | 70B 362B| 0 0 | 747 2669
I won’t dig more deeply into this tool, but feel to experiment with it.
Understanding RAM
Generally, free is enough to tell us about memory usage. However, /proc/meminfo offers a little more detail. There are some subtle details that are good to know and can help in debugging a performance issue.
Here is a simple script that prints out useful memory information:
#!/bin/bash grep -A3 "MemTotal" /proc/meminfo grep "Swap" /proc/meminfo grep -A1 "Dirty\|Active" /proc/meminfo
MemTotal: 792365444 kB MemFree: 711360772 kB MemAvailable: 743213224 kB Buffers: 1084 kB SwapCached: 0 kB SwapTotal: 4194300 kB SwapFree: 4194300 kB Active: 18503648 kB Inactive: 23735296 kB Active(anon): 10553220 kB Inactive(anon): 534376 kB Active(file): 7950428 kB Inactive(file): 23200920 kB -- Dirty: 56 kB Writeback: 0 kB
Some of the key values to note are “Dirty”, “Writeback”, “Active(file)”, “Inactive(file)”, “Active(anon)”, and “Inactive(anon)”. “Dirty” depicts the amount of memory waiting to be written to disk, whereas, “Writeback” indicates the memory being actively written to disk. Active(file) and Inactive(file) are the OS file cache. You want this to be as high as possible as that would lead a reduced IO, thereby significantly improving performance.
Another key set of items to note is the “anon” memory. In PostgreSQL context, this will increase as the number of client connections increase. So the higher the “Active(anon)” and “Inactive(anon)”, the less there is for the OS to cache files. Therefore, keep a close watch on this, or alternatively, use a connection pooler.
If you have enabled “Hugpages” and configured for PostgreSQL, you might want to add a grep statement for it. Do remember to disable “Transparent Hugepages” though as that creates a memory allocation overhead which slows down PostgreSQL.
For more details about about /proc/meminfo, refer to its documentation here:
Conclusion
Although this blog has focused on disk and RAM issues, the tools mentioned here can be used to evaluate CPU and network utilization. Performance degradation because of these resources significantly impacts tps and the provided information should enable you to debug any capacity or configuration problems with hardware or the OS.
There will be the next part of this blog that covers specific extensions and changes to the PostgreSQL configuration that must be made to troubleshoot from PostgreSQL. Eventually, we’ll go through optimizations and tuning of a PostgreSQL cluster.
Hamid has more than 19 years of professional software development experience and has been involved with PostgreSQL for more than 8 years now. He was part of EnterpriseDB and managed PostgreSQL official installers for Windows, Linux and MacOS. Hamid has recently joined HighGo Software Inc. as Senior DevOps and Database Architect.
Recent Comments