pg_auto_failover: How to re-initialize an existing node
search cancel

pg_auto_failover: How to re-initialize an existing node

book

Article ID: 296390

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

If there's ever a scenario where you need to re-create a node due to node failure and the node is not recoverable, this KB will describe the steps to recreate it. Let's use the following cluster as an example:
[postgres@postgres1-m ~]$ pg_autoctl show state
  Name |  Node |        Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | postgres1-1:5432 | 0/9000110 |       yes |             primary |             primary
node_2 |     2 | postgres1-2:5432 | 0/9000000 |       yes |           secondary |           secondary
node_3 |     3 | postgres1-3:5432 | 0/7000188 |        no |          catchingup |          catchingup

Ideally, it would be best to use the following commands to check why replication is not working and if it's possible to bring it up without having the recreate node_3:
 
  • journalctl | grep pgautofailover (If you use systemd)
  • $PGDATA/logs/postgres<date>.log

If, for whatever reason, there is no way to recover, then you can try to recreate the node using this procedure.

Environment

Product Version: 12.5

Resolution

You can drop the node either from the monitor or the Postgres node. Dropping from the monitor will apply if the Postgres host is not accessible, but the monitor still has record of it. 


Drop from monitor

On the monitor:
[postgres@postgres1-m ~]$ pg_autoctl drop node --hostname postgres1-3 --pgport 5432
[postgres@postgres1-m ~]$ pg_autoctl show state
  Name |  Node |        Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | postgres1-1:5432 | 0/90001C0 |       yes |             primary |             primary
node_2 |     2 | postgres1-2:5432 | 0/90001C0 |       yes |           secondary |           secondary

On node3 (postgres1-3):

1. Stop pg_auto_failover daemon if it's still running:
[postgres@postgres1-3 ~]$ sudo systemctl stop pgautofailover

2. Move the old PGDATA. Please don't delete the old PGDATA in case an RCA is needed:
[postgres@postgres1-3 ~]$ echo $PGDATA
/home/postgres/ha

[postgres@postgres1-3 ~]$ mv ha ha_2

If you try to create the node now, there may be some existing state files that need to be cleaned up:
[postgres@postgres1-3 ~]$ pg_autoctl create postgres --pgdata ha --auth trust --ssl-self-signed --username ha-admin --dbname appdb --hostname  postgres1-3 --pgctl /usr/pgsql-12/bin/pg_ctl --monitor 'postgres://autoctl_node@postgres1-m/pg_auto_failover?sslmode=require'
22:47:50 23493 INFO  Using default --ssl-mode "require"
22:47:50 23493 INFO  Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic
22:47:50 23493 WARN  Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks.
22:47:50 23493 WARN  See https://www.postgresql.org/docs/current/libpq-ssl.html for details
22:47:50 23493 INFO  Started pg_autoctl postgres service with pid 23496
22:47:50 23496 INFO   /usr/pgsql-12/bin/pg_autoctl do service postgres --pgdata ha -v
22:47:50 23493 INFO  Started pg_autoctl node-init service with pid 23497
22:47:50 23497 FATAL The state file "/home/postgres/.local/share/pg_autoctl/home/postgres/ha/pg_autoctl.state" exists and there's no init in progress
22:47:50 23497 INFO  HINT: use `pg_autoctl run` to start the service.
22:47:51 23493 WARN  pg_autoctl service node-init exited with exit status 0
22:47:51 23496 INFO  Postgres controller service received signal SIGTERM, terminating
22:47:51 23493 INFO  Stop pg_autoctl

3. Running pg_autoctl drop node --destroy will clean up the .local directory.
[postgres@postgres1-3 ~]$ pg_autoctl drop node --destroy

4. Create node will work now:
[postgres@postgres1-3 ~]$ pg_autoctl create postgres --pgdata ha --auth trust --ssl-self-signed --username ha-admin --dbname appdb --hostname  postgres1-3 --pgctl /usr/pgsql-12/bin/pg_ctl --monitor 'postgres://autoctl_node@postgres1-m/pg_auto_failover?sslmode=require'
22:57:42 23727 INFO  Using default --ssl-mode "require"
22:57:42 23727 INFO  Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic
22:57:42 23727 WARN  Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks.
22:57:42 23727 WARN  See https://www.postgresql.org/docs/current/libpq-ssl.html for details
22:57:42 23727 INFO  Started pg_autoctl postgres service with pid 23730
22:57:42 23727 INFO  Started pg_autoctl node-init service with pid 23731
22:57:42 23730 INFO   /usr/pgsql-12/bin/pg_autoctl do service postgres --pgdata ha -v
22:57:42 23731 INFO  Registered node 4 (postgres1-3:5432) with name "node_3" in formation "default", group 0, state "wait_standby"
22:57:42 23731 INFO  Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/home/postgres/ha/pg_autoctl.state"
22:57:42 23731 INFO  Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/home/postgres/ha/pg_autoctl.init"
22:57:42 23731 INFO  Successfully registered as "wait_standby" to the monitor.
22:57:42 23731 INFO  FSM transition from "init" to "wait_standby": Start following a primary
22:57:42 23731 INFO  Transition complete: current state is now "wait_standby"
22:57:42 23731 INFO  New state for node 1 "node_1" (postgres1-1:5432): primary ➜ join_primary
22:57:42 23731 INFO  New state for node 1 "node_1" (postgres1-1:5432): join_primary ➜ join_primary
22:57:42 23731 INFO  Still waiting for the monitor to drive us to state "catchingup"
22:57:42 23731 WARN  Please make sure that the primary node is currently running `pg_autoctl run` and contacting the monitor.
22:57:42 23731 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
22:57:42 23731 INFO  Initialising PostgreSQL as a hot standby
22:57:42 23731 INFO   /usr/pgsql-12/bin/pg_basebackup -w -d application_name=pgautofailover_standby_4 host=postgres1-1 port=5432 user=pgautofailover_replicator sslmode=require --pgdata /home/postgres/backup/node_3 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_4
22:57:42 23731 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
22:57:42 23731 INFO  pg_basebackup: checkpoint completed
22:57:42 23731 INFO  pg_basebackup: write-ahead log start point: 0/A000028 on timeline 3
22:57:42 23731 INFO  pg_basebackup: starting background WAL receiver
22:57:42 23731 INFO      0/33550 kB (0%), 0/1 tablespace (...tgres/backup/node_3/backup_label)
22:57:43 23731 INFO  26835/33550 kB (79%), 0/1 tablespace (...ackup/node_3/base/16385/2616_fsm)
22:57:43 23731 INFO  33559/33559 kB (100%), 0/1 tablespace (.../backup/node_3/global/pg_control)
22:57:43 23731 INFO  33559/33559 kB (100%), 1/1 tablespace                                         
22:57:43 23731 INFO  pg_basebackup: write-ahead log end point: 0/A000138
22:57:43 23731 INFO  pg_basebackup: waiting for background process to finish streaming ...
22:57:43 23731 INFO  pg_basebackup: syncing data to disk ...
22:57:43 23731 INFO  pg_basebackup: base backup completed
22:57:43 23731 INFO  Creating the standby signal file at "/home/postgres/ha/standby.signal", and replication setup at "/home/postgres/ha/postgresql-auto-failover-standby.conf"
22:57:43 23731 INFO  Contents of "/home/postgres/ha/postgresql-auto-failover-standby.conf" have changed, overwriting
22:57:43 23731 INFO   /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /home/postgres/ha/server.crt -keyout /home/postgres/ha/server.key -subj "/CN=postgres1-3"
22:57:43 23739 INFO   /usr/pgsql-12/bin/postgres -D /home/postgres/ha -p 5432 -h *
22:57:43 23731 INFO  PostgreSQL started on port 5432
22:57:43 23731 INFO  Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes.
22:57:43 23731 INFO  Ensuring HBA rules for node 1 "node_1" (postgres1-1:5432)
22:57:43 23731 INFO  Ensuring HBA rules for node 2 "node_2" (postgres1-2:5432)
22:57:43 23731 INFO  Transition complete: current state is now "catchingup"
22:57:43 23730 INFO  Postgres is now serving PGDATA "/home/postgres/ha" on port 5432 with pid 23739
22:57:43 23731 INFO  keeper has been successfully initialized.
22:57:43 23727 WARN  pg_autoctl service node-init exited with exit status 0
22:57:43 23730 INFO  Postgres controller service received signal SIGTERM, terminating
22:57:43 23730 INFO  Stopping pg_autoctl postgres service
22:57:43 23730 INFO  /usr/pgsql-12/bin/pg_ctl --pgdata /home/postgres/ha --wait stop --mode fast
22:57:44 23727 INFO  Stop pg_autoctl

5. pg_autoctl run or if using systemd:
sudo systemctl start pgautofailover

Notice that the nodeid in pg_autoctl state increments from 3 to 4:
[postgres@postgres1-m ~]$ pg_autoctl show state
  Name |  Node |        Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | postgres1-1:5432 | 0/B000148 |       yes |             primary |             primary
node_2 |     2 | postgres1-2:5432 | 0/B000148 |       yes |           secondary |           secondary
node_3 |     4 | postgres1-3:5432 | 0/B000148 |       yes |           secondary |           secondary


Drop from the Postgres host

You can drop/destroy the node from postgres 1-3 host as well. To preserve the directory for RCA, please drop the node in this sequence:

1. pg_autoctl stop or sudo systemctl stop pgautofailover
2. pg_autoctl drop node
3. mv ha ha_back
4. pg_autoctl drop node --destroy
5. Follow from step 4 in the previous section

This procedure has been working as of pg_auto_failover version 1.4.2.