How to reinitialise a postgres replica with pg_auto

Products

VMware Tanzu SQL

Issue/Introduction

If the replica is not able to sync with the primary DB it may be necessary to reinitialise the replica.
This can happen if the replica needs a WAL file that has already been deleted from the primary.

This KB describes the steps to reinitialise the replica on Kubernetes

Environment

Product Version: 1.6

Resolution

List the pods in the namespace

kubectl get pods -n <namespace>

Example:

# kubectl get pods -n ns-app01-postgres
NAME                                 READY   STATUS    RESTARTS   AGE
postgres-0                           5/5     Running   0          14d
postgres-1                           4/5     Running   0          2d20h
postgres-monitor-0                   4/4     Running   0          9d
postgres-operator-6cdf6bd9b7-qb8qq   1/1     Running   0          7d

Note: postgres-1 has 4 of 5 containers ready. This is the replica in the example that needs to be reinitialised.

Confirm that the node is a replica

kubectl get pod <replica-pod-name> -n <namespace> --show-labels

Example:

# kubectl get pod postgres-1 -n ns-app01-postgres --show-labels
NAME                READY   STATUS    RESTARTS   AGE   LABELS
postgres-0          4/5     Running   0          27m   app=postgres,controller-revision-hash=postgres-sample-84574d54c8,headless-service=postgres-sample,postgres-instance=postgres-sample,role=read,statefulset.kubernetes.io/pod-name=postgres-sample-0,type=data

Note: "role=read" means it is a replica. A primary pod would have "role=read-write".

Exec into the replica node that needs to be reinitialised

kubectl exec -it <replica-pod-name-> -n <namespace> -c postgres-sidecar -- bash

Example:

# kubectl exec -it -n ns-app01-postgres postgres-1 -c postgres-sidecar -- bash
postgres@postgres-1:/$

On the replica pod, stop the pg_autoctl process if is running

On the replica pod check if the pg_autoctl is running with a "kill -STOP"

ps -aef | egrep pg_auto
kill -s STOP $(pgrep -f "pg_autoctl: start/stop postgres")

On the replica pod, stop the Postgres process, if it is running

pg_ctl stop

On the replica pod, reinitialise the replica

PGPASSWORD=$(pg_autoctl config get replication.password) PG_AUTOCTL_DEBUG=true pg_autoctl do standby init <primary-pod-name>.$(hostname -d) 5432

Example:

# PGPASSWORD=$(pg_autoctl config get replication.password) PG_AUTOCTL_DEBUG=true pg_autoctl do standby init postgres-0.$(hostname -d) 5432
 
Defaulted container "pg-container" out of: pg-container, instance-logging, reconfigure-instance, postgres-metrics-exporter, postgres-sidecar
16:14:56 43453 INFO Initialising PostgreSQL as a hot standby
16:14:56 43453 INFO Target directory exists: "/pgsql/data", stopping PostgreSQL
16:14:57 43453 INFO /opt/vmware/postgres/15/bin/pg_basebackup -w -d 'application_name=pgautofailover_standby_0 host=postgres-1.postgres-agent.default.svc.cluster.local port=5432 user=pgautofailover_replicator ' --pgdata /pgsql/backup/ -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream
16:14:57 43453 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
16:14:57 43453 INFO pg_basebackup: checkpoint completed
16:14:57 43453 INFO pg_basebackup: write-ahead log start point: 0/4000028 on timeline 2
16:14:57 43453 INFO pg_basebackup: starting background WAL receiver
16:14:57 43453 INFO pg_basebackup: created temporary replication slot "pg_basebackup_618400"
16:14:58 43453 INFO 38802/54859 kB (70%), 0/1 tablespace (/pgsql/backup//base/24627/14027_vm )
16:14:58 43453 INFO 54870/54870 kB (100%), 0/1 tablespace (/pgsql/backup//global/pg_control )
16:14:58 43453 INFO 54870/54870 kB (100%), 1/1 tablespace
16:14:58 43453 INFO pg_basebackup: write-ahead log end point: 0/4000100
16:14:58 43453 INFO pg_basebackup: waiting for background process to finish streaming ...
16:14:58 43453 INFO pg_basebackup: syncing data to disk ...
16:15:00 43453 INFO pg_basebackup: renaming backup_manifest.tmp to backup_manifest
16:15:00 43453 INFO pg_basebackup: base backup completed
16:15:01 43453 INFO Creating the standby signal file at "/pgsql/data/standby.signal", and replication setup at "/pgsql/data/postgresql-auto-failover-standby.conf"
16:15:01 43453 INFO Contents of "/pgsql/data/postgresql-auto-failover-standby.conf" have changed, overwriting
16:15:01 43453 INFO Contents of "/pgsql/data/postgresql-auto-failover.conf" have changed, overwriting
16:15:02 43453 WARN Failed to read Postgres "postmaster.pid" file
16:15:11 43453 ERROR Failed to open file "/pgsql/data/postmaster.pid": No such file or directory
16:15:11 43453 INFO Is PostgreSQL at "/pgsql/data" up and running?
16:15:11 43453 ERROR Failed to get Postgres pid, see above for details
16:15:11 43453 ERROR Failed to ensure that Postgres is running in "/pgsql/data"
16:15:11 43453 FATAL Failed to grant access to the standby by adding relevant lines to pg_hba.conf for the standby hostname and user, see above for details
command terminated with exit code 4

Note: The startup of the Postgres fails. This is expected. It will be started properly in the next step.
Note: The pod specified in the command is the primary pod

Delete the replica pod

kubectl delete pod -n <namespace> <replica-pod-name>

Example:

# kubectl delete pod -n ns-app01-postgres postgres-1 --grace-period=0
pod "postgres-1" deleted

Note: It may necessary to add the flag "--force=true" if it does not terminate.
Note: The pod will automatically get recreated and Postgres should start as a replica

Check status of the pods

kubectl get pods -n <namespace>

Example:

# kubectl get pods -n ns-app01-postgres
NAME                                 READY   STATUS    RESTARTS   AGE
postgres-0                           5/5     Running   0          14d
postgres-1                           5/5     Running   0          10min
postgres-monitor-0                   4/4     Running   0          9d
postgres-operator-6cdf6bd9b7-qb8qq   1/1     Running   0          7d

Note: postgres-1 has 5 of 5 containers ready now.