Restore Failing after Upgrade - Postgres instance stuck in Init:CrashLoopBackOff phase
search cancel

Restore Failing after Upgrade - Postgres instance stuck in Init:CrashLoopBackOff phase

book

Article ID: 293322

calendar_today

Updated On:

Products

VMware Tanzu SQL

Issue/Introduction

In some cases during restore of backup for a Postgres for K8s environment, one of the pods (Postgres Instance) will go into Init:CrashLoopBackOff state and will continually try to heal itself but be stuck in a loop with this error. The pod log will show this error and continue to loop:
15:17:03 10481 ERROR   pg_autoctl service is not running, changes will only apply at next start of pg_autoctl
pg-instance-0 : creating postgres and/or pg_auto_failover state (/pgsql/data)...
15:17:03 10484 INFO  Started pg_autoctl postgres service with pid 10486
15:17:03 10486 INFO   /opt/vmware/postgres/14/bin/pg_autoctl do service postgres --pgdata /pgsql/data -v
15:17:03 10484 INFO  Started pg_autoctl node-init service with pid 10487
15:17:03 10487 INFO  Continuing from a previous `pg_autoctl create` failed attempt
15:17:03 10487 INFO  PostgreSQL state at registration time was: PGDATA does not exist
15:17:03 10487 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:17:03 10487 INFO  Initialising PostgreSQL as a hot standby
15:17:04 10487 ERROR Failed to request TIMELINE_HISTORY: ERROR:  could not open file "pg_wal/00000002.history": No such file or directory

15:17:04 10487 ERROR Failed to connect to the primary with a replication connection string. See above for details
15:17:04 10487 ERROR Failed to initialize standby server, see above for details
15:17:04 10487 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
15:17:04 10484 ERROR pg_autoctl service node-init exited with exit status 12
15:17:04 10484 INFO  Restarting service node-init
15:17:04 10489 INFO  Continuing from a previous `pg_autoctl create` failed attempt
15:17:04 10489 INFO  PostgreSQL state at registration time was: PGDATA does not exist
15:17:04 10489 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:17:04 10489 INFO  Initialising PostgreSQL as a hot standby
15:17:04 10489 ERROR Failed to request TIMELINE_HISTORY: ERROR:  could not open file "pg_wal/00000002.history": No such file or directory

15:17:04 10489 ERROR Failed to connect to the primary with a replication connection string. See above for details
15:17:04 10489 ERROR Failed to initialize standby server, see above for details
15:17:04 10489 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
15:17:04 10484 ERROR pg_autoctl service node-init exited with exit status 12
15:17:04 10484 INFO  Restarting service node-init
15:17:04 10490 INFO  Continuing from a previous `pg_autoctl create` failed attempt 
kubectl get all - n <namespace> output will look like this:
[postgres@postgres-operator ~]$ kubectl get all
NAME                                     READY   STATUS    RESTARTS   AGE
pod/backup-incremental-<timestamp>       0/1     Completed   0          16m
pod/backup-incremental-<timestamp>       0/1     Completed   0          90s
pod/postgres-sample-0                    0/5     Init:CrashLoopBackOff  178        19h
pod/postgres-sample-1                    5/5     Running   0          19h
pod/postgres-sample-monitor-0            4/4     Running   0          19h

NAME                                            STATUS    SOURCE BACKUP   TARGET INSTANCE   TIME STARTED    TIME COMPLETED
postgresrestore.sql.tanzu.vmware.com/restore   Finalizing   pg-instance-backup pg-instance  2021-12-06T23:14:48                   
Normally, the High Availability setting needs to be set to true to see this issue happen.


Environment

Product Version: Other

Resolution

This will be fixed in a future version of VMware Tanzu for SQL Postgres for K8s.