How to collect logs for pg_auto_failover issues in Postgres

Products

VMware Tanzu Greenplum

Issue/Introduction

Please refer to this checklist of artifacts to collect if there is a pg_auto_failover issue related to any of the following:

Secondary (replica/standby) going down.
Primary fails over to secondary.
Failed switchover or failover.
Failed secondary is unable to recover and rejoin the formation (cluster).

Assume the following about our environment:

[postgres@autofailover-1 ~]$ pg_autoctl show state
  Name |  Node |           Host:Port |        LSN | Reachable |       Current State |      Assigned State
-------+-------+---------------------+------------+-----------+---------------------+--------------------
node_1 |     1 | autofailover-1:5432 | 0/1E000110 |       yes |             primary |             primary
node_2 |     2 | autofailover-2:5432 | 0/1E000000 |       yes |           secondary |           secondary
node_5 |     5 | autofailover-3:5432 | 0/1E000000 |       yes |           secondary |           secondary

Where:

Postgres PGDATA is /var/lib/pgsql/data.
Monitor PGDATA is /var/lib/pgsql/monitor.

Environment

Product Version: 13.2

Resolution

pg_autoctl run

If pg_autoctl run is used the launch pg_auto_failover services, then the log that you need should be redirected to a file somewhere. The default is to print the log to stdout.

Assuming this command is run, you need to collect the following:

pg_autoctl run --pgdata data --pgport 5432 > autofailover.log 2>&1 &

Run this command to collect the following information:

pg_autoctl run --pgdata data --pgport 5432 > autofailover.log 2>&1 &

autofailover.log for each node involved in the failover failure.
autofailover.log for the monitor.
Postgres log for each node involved in the failover or failure.
pg_autoctl show state: pg_autoctl_state.out
pg_autoctl show events -n 1000: pg_autoctl_events.out
/var/log/messages for each node involved in the failure if a hardware issue is suspected. dmesg -T works as alternative.
pg_autoctl.cfg on all nodes. Default location is /var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/data (Please examine the file and remove any sensitive information before sharing)
If the issue is reproducible then try starting pg_auto_failover with verbose logging:

pg_autoctl run --pgdata data --pgport 5432 -vvv > autofailover_3.log 2>&1

pg_auto_failover managed by systemd

You can also leverage the systemd service manager to manage pg_auto_failover. Generally, you use this command to create the service:

pg_autoctl -q show systemd --pgdata ~postgres/data > pgautofailover.service

In this case, the log can be found using the journalctl command to filter out the system logs for all services.

To find each node involved in the failover failure and the monitor, use the following command to extract all system messages related to pgautofailover.service. You must be root or have sudo privileges to run journalctl:

[postgres@autofailover-1 ~]$ sudo journalctl -u pgautofailover.service
-- Logs begin at Fri 2021-06-11 21:14:19 UTC, end at Tue 2021-07-13 20:41:55 UTC. --
Jul 13 20:41:25 autofailover-1 systemd[1]: Started pg_auto_failover.
Jul 13 20:41:25 autofailover-1 systemd[1]: Starting pg_auto_failover...
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11773 INFO  Started pg_autoctl postgres service with pid 11776
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11773 INFO  Started pg_autoctl node-active service with pid 11777
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11776 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/pgsql/data -v
...
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11777 INFO  Transition complete: current state is now "catchingup"
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11777 INFO  Updated the keeper's state from the local PostgreSQL instance, which is running
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11777 INFO  pg_autoctl managed to ensure current state "catchingup": PostgreSQL is running
Jul 13 20:41:25 autofailover-1 pg_autoctl[11773]: 20:41:25 11776 INFO  Postgres is now serving PGDATA "/var/lib/pgsql/data" on port 5432 with pid 11786
Jul 13 20:41:26 autofailover-1 pg_autoctl[11773]: 20:41:26 11777 INFO  Monitor assigned new state "secondary"
Jul 13 20:41:26 autofailover-1 pg_autoctl[11773]: 20:41:26 11777 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
Jul 13 20:41:26 autofailover-1 pg_autoctl[11773]: 20:41:26 11777 INFO  Transition complete: current state is now "secondary"

Note: You can filter these messages by time or type. For example:

Filter by using a date timestamp to look at certain dates:

sudo journalctl -u pgautofailover.service --since "2021-07-12 00:30:00" > autofailover1.log

Filter to show only errors:

sudo journalctl -u pgautofailover.service --since "2021-07-12 00:30:00" -p err > autofailover1_error.log