pgcheck

Source: https://github.com/plockaby/pgcheck

The pgcheck tool is a monitoring PostgreSQL based on Bucardo’s check_postgres tool. You define what checks you want in one or more configuration files and this tool will run those checks and forward them to whatever monitoring system that you have. We use this at the University of Washington to monitor all of our servers and send them to our EventAPI.

Running

This tool should run forever as a daemon. It does run in the foreground so you might want to control it with something like supervisord or systemd. It requires one argument: --configuration.

The --configuration argument should be a path to where the configuration files are. This will be evaluated as a glob. For example, you might write --configuration=/etc/pgcheck/*.ini to load all ini files in a directory.

Configuring

Configuration files are ini style files that look like this:

[backends]
contact = me@example.com
ok_severity = 5
warning_severity = 5
critical_severity = 7
warning = 16
critical = 32

The name of the section should match the name of the action in the check_postgres documentation. Some of the fields below that are optional and some are required, depending on the action and depending on how you want the monitor to work.

  • action
    This is the name of the action in check_postgres that you want to call. If this is not present then the name of the ini file section is used. It’s handy to set this when you want to use the same monitor multiple times but with different names such as when you’re monitoring replication to multiple servers and want to monitor how each server is replicating.
  • contact
    This is passed to the _send_alarm and _clear_alarm functions described below. Nothing else is done with this field.
  • ok_severity
    This severity, if present, is passed to the _clear_alarm function when the monitor says that everything is ok. This is handy if you want to build a monitoring dashboard that says “everything is ok and here are how many free connections (e.g.) that the server currently has available!”
  • warning_severity
    This severity is passed to the _send_alarm function when something is kinda sorta going wrong. This is the first threshold of monitoring for the system. It is kind of a “yellow light”. This is required.
  • critical_severity
    This severity is passed to the _send_alarm function when something is definitely going wrong. This is the second and final threshold of the monitoring system. It is a “red light”. This is required.
  • error_severity
    This severity is used when check_postgres encounters an error. By default it is the same as the critical_severity.
  • unknown_severity
    This severity is used when check_postgres encounters some unknown error. By default it is the same as the critical severity.

Any other fields encountered in the configuration file are prefixed with two hyphens and passed directly to check_postgres. So, for example, you could put warning = 5 into your configuration file and check_postgres would be called with --warning=5.

Each check is run approximately once per minute and the results are sent to the functions described in the next section so that you can forward that information to your monitoring system.

Modifying

By default this program does nothing when it detects a problem. It has two functions in it called _send_alarm and _clear_alarm. The former will be called whenever an alarm is triggered and the latter whenever an alarm clears itself. Both functions are called regardless of the previous state. That is, if your system continues to fail then _send_alarm will be called over and over and over. If you don’t implement these functions then this tool really does nothing.

Common Monitors

Here are some common monitors that I have used.

[connection]
contact = me@example.com
warning_severity = 1
critical_severity = 1

This connection monitor will raise a severity 1 event when it is not able to connect to the database on localhost.

[backends]
contact = me@example.com
ok_severity = 5
warning_severity = 4
critical_severity = 2
warning = 50%
critical = 80%

This backend monitor will raise a severity 4 event when the database running on localhost has 50% of the available connections in use and a severity 2 event when it has 80% of the available connections in use. It will also raise a severity 5 event when there are fewer than 50% of the connections in use so that we can see on our dashboard the number of available connections at any given time.

[archive_ready]
contact = me@example.com
warning_severity = 4
critical_severity = 2
warning = 16
critical = 32

This archive monitor will raise a severity 4 event when there are 16 WAL files that haven’t been transferred by the archive_ready command. It will raise a severity 2 event when there are 32 WAL files. This is handy to ensure that your backup system is receiving backups correctly as WAL files will start to pile up if they can’t be sent off the host.

[hot_standby_delay_secondary]
action = hot_standby_delay
contact = me@example.com
ok_severity = 0
warning_severity = 4
critical_severity = 1
warning = 1
critical = 10 min
host = db01.infra.example.com,db02.infra.example.com

This hot standby delay monitor calls the hot_standby_delay action in check_postgres. It compares our primary (db01.infra.example.com) to our secondary (db02.infra.example.com) and raises a severity 4 event if the secondary is behind at all from the primary. It will raise a severity 2 event if the secondary is more than ten minutes behind our primary. The primary and secondary are listed in the host option. If you have multiple replicas you can write this configuration multiple times with different names and simply change the host names in the host option.