In some cases during restore of backup for a Postgres for K8s environment, one of the pods (Postgres Instance) will go into Init:CrashLoopBackOff state and will continually try to heal itself but be stuck in a loop with this error. The pod log will show this error and continue to loop:
15:17:03 10481 ERROR pg_autoctl service is not running, changes will only apply at next start of pg_autoctl
pg-instance-0 : creating postgres and/or pg_auto_failover state (/pgsql/data)...
15:17:03 10484 INFO Started pg_autoctl postgres service with pid 10486
15:17:03 10486 INFO /opt/vmware/postgres/14/bin/pg_autoctl do service postgres --pgdata /pgsql/data -v
15:17:03 10484 INFO Started pg_autoctl node-init service with pid 10487
15:17:03 10487 INFO Continuing from a previous `pg_autoctl create` failed attempt
15:17:03 10487 INFO PostgreSQL state at registration time was: PGDATA does not exist
15:17:03 10487 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:17:03 10487 INFO Initialising PostgreSQL as a hot standby
15:17:04 10487 ERROR Failed to request TIMELINE_HISTORY: ERROR: could not open file "pg_wal/00000002.history": No such file or directory
15:17:04 10487 ERROR Failed to connect to the primary with a replication connection string. See above for details
15:17:04 10487 ERROR Failed to initialize standby server, see above for details
15:17:04 10487 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
15:17:04 10484 ERROR pg_autoctl service node-init exited with exit status 12
15:17:04 10484 INFO Restarting service node-init
15:17:04 10489 INFO Continuing from a previous `pg_autoctl create` failed attempt
15:17:04 10489 INFO PostgreSQL state at registration time was: PGDATA does not exist
15:17:04 10489 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:17:04 10489 INFO Initialising PostgreSQL as a hot standby
15:17:04 10489 ERROR Failed to request TIMELINE_HISTORY: ERROR: could not open file "pg_wal/00000002.history": No such file or directory
15:17:04 10489 ERROR Failed to connect to the primary with a replication connection string. See above for details
15:17:04 10489 ERROR Failed to initialize standby server, see above for details
15:17:04 10489 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
15:17:04 10484 ERROR pg_autoctl service node-init exited with exit status 12
15:17:04 10484 INFO Restarting service node-init
15:17:04 10490 INFO Continuing from a previous `pg_autoctl create` failed attempt
kubectl get all - n <namespace> output will look like this:
[postgres@postgres-operator ~]$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/backup-incremental-<timestamp> 0/1 Completed 0 16m
pod/backup-incremental-<timestamp> 0/1 Completed 0 90s
pod/postgres-sample-0 0/5 Init:CrashLoopBackOff 178 19h
pod/postgres-sample-1 5/5 Running 0 19h
pod/postgres-sample-monitor-0 4/4 Running 0 19h
NAME STATUS SOURCE BACKUP TARGET INSTANCE TIME STARTED TIME COMPLETED
postgresrestore.sql.tanzu.vmware.com/restore Finalizing pg-instance-backup pg-instance 2021-12-06T23:14:48
Normally, the High Availability setting needs to be set to true to see this issue happen.