Monitoring and Diagnosis

In this chapter, you will find recipes for some common monitoring and diagnosis actions you will want to do inside your database. They are meant to answer specific questions that you often face when using PostgreSQL.

Databases are not isolated entities. They live on computer hardware using CPUs, RAM, and disk subsystems. Users access databases using networks. Depending on the setup, databases themselves may need network resources to function in any of the following ways: performing some authentication checks when users log in, using disks that are mounted over the network (not generally recommended), or making remote function calls to other databases.

This means that monitoring only the database is not enough. As a minimum, one should also monitor everything directly involved in using the database. This means knowing the following:

Is the database host available? Does it accept connections?
How much of the network bandwidth is in use? Have there been network interruptions and dropped connections?
Is there enough RAM available for the most common tasks? How much of it is left?
Is there enough disk space available? When will you run out of disk space?
Is the disk subsystem keeping up? How much more load can it take?
Can the CPU keep up with the load? How many spare idle cycles do the CPUs have?
Are other network services the database access depends on (if any) available? For example, if you use Kerberos for authentication, you need to monitor it as well.
How many context switches are happening when the database is running?
For most of these things, you are interested in the history; that is, how have things evolved? Was everything mostly the same yesterday or last week?
When did the disk usage start changing rapidly?
For any larger installation, you probably have something already in place to monitor the health of your hosts and network.

The two aspects of monitoring are collecting historical data to see how things have evolved, and getting alerts when things go seriously wrong. Tools based on a Round Robin Database Tool (RRDtool) such as Munin, or time series databases such as Graphite’s Carbon, and Prometheus, are quite popular for collecting the historical information on all aspects of the servers and presenting this information in an easy-to-follow graphical form, for which Grafana is a popular tool. Seeing several statistics on the same timescale can really help when trying to figure out why the system is behaving the way it is.

Another aspect of monitoring is getting alerts when something goes really wrong and needs (immediate) attention.

For alerting, one of the most widely used tools is Icinga (a fork of Nagios), an established solution. The aforementioned trending tools can integrate with it. check_postgres is a rather popular Icinga plugin for monitoring many standard aspects of a PostgreSQL database server.

Icinga is a stable and mature solution, based on the long-standing approach where each plugin decides whether a given measurement is cause of alarm, more complex to manage and maintain. A more recent tool is the aforementioned Prometheus, based on a design that separates data collection from the centralized alerting logic.

Should you need a solution for both the alerting and trending aspects of a monitoring tool, you might also want to look into Zabbix.

Furthermore, if you require integration with a system based on Simple Network Management Protocol (SNMP), the pgsnmpd project offers some basic support. However, as it is not actively maintained, we recommend switching to one of the other monitoring approaches if at all possible.

In this chapter, we will cover the following recipes:

Providing PostgreSQL information to monitoring tools
Real-time viewing using pgAdmin or OmniDB
Checking whether a user is connected
Checking whether a computer is connected
Repeatedly executing a query in psql
Checking which queries are running
Checking which queries are active or blocked
Knowing who is blocking a query
Killing a specific session
Detecting an in-doubt prepared transaction
Knowing whether anybody is using a specific table
Knowing when a table was last used
Usage of disk space by temporary data
Understanding why queries slow down
Investigating and reporting a bug
Producing a daily summary of log file errors
Analyzing the real-time performance of your queries