postgresql-init job fails during the hubsm-install errand, returning the following errors in the Opsman GUI:10:20:29PM: ^ Reconcile failed: message: kapp: Error: waiting on reconcile packageinstall/postgresql-init (packaging.carvel.dev/v1alpha1) namespace: tanzusm: Finished waiting unsuccessfully: Reconcile failed: message: kapp: Error: waiting on reconcile job/postgresql-init-job (batch/v1) namespace: tanzusm: Finished waiting unsuccessfully: Failed with reason BackoffLimitExceeded: Job has reached the specified backoff limitFrom an SSH to the Registry VM, when checking pods in the tanzusm namespace, you see that all 3 postgresql pods are up and running and the package install completed successfully, but postgresql-init and postgresql-migration job pods are in Error state:registry/########-####-####-####-############:~$ kubectl get pods -n tanzusm | grep postpostgres-operator-<REPLICASET>-<POD_ID> 1/1 Running 0 60mpostgresql-0 4/4 Running 0 54mpostgresql-1 4/4 Running 0 54mpostgresql-2 4/4 Running 0 54mpostgresql-backup-0 1/1 Running 0 54mpostgresql-backup-cleanup-<BACKUP_ID> 0/1 Completed 0 12mpostgresql-backup-cleanup-<BACKUP_ID> 0/1 Completed 0 2m8spostgresql-init-job- 0/1 Error 0 52m<POD_ID>postgresql-init-job- 0/1 Error 0 52m<POD_ID>postgresql-init-job- 0/1 Error 0 51m<POD_ID>postgresql-init-job- 0/1 Error 0 47m<POD_ID>postgresql-init-job- 0/1 Error 0 41m<POD_ID>postgresql-init-job- 0/1 Error 0 49m<POD_ID>postgresql-init-job- 0/1 Error 0 51m<POD_ID>postgresql-migration-job- 0/1 Error 0 51m<POD_ID>postgresql-migration-job- 0/1 Error 0 41m<POD_ID>postgresql-migration-job- 0/1 Error 0 47m<POD_ID>postgresql-migration-job- 0/1 Error 0 52m<POD_ID>postgresql-migration-job- 0/1 Error 0 52m<POD_ID>postgresql-migration-job- 0/1 Error 0 51m<POD_ID>postgresql-migration-job- 0/1 Error 0 49m<POD_ID>
Checking logging on the postgresql-init pods, you see authentication failures:registry/########-####-####-####-############:~$ kubectl logs -n tanzusm postgresql-init-job-hw4dtRunning postgres init script...Checking app user...psql: error: connection to server at "postgresql" (<POD_IP_ADDRESS>), port 5432 failed: FATAL: password authentication failed for user "pgadmin"
postgresql-init and postgresql-migration pods, it is likely you will see secondary job failures resulting from this condition:registry/########-####-####-####-############:~$ kubectl get jobs -n tanzusmNAME STATUS COMPLETIONS DURATION AGEclickhouse-readonly-recovery-job Complete 1/1 4m 82mcontour-contour-certgen Complete 1/1 9s 83mensemble-application-metadata-lemans-hook-job Failed 0/1 62m 62mensemble-application-metadata-topic-hook-job Complete 1/1 49s 62mensemble-observability-store-lemans-hook-job Failed 0/1 62m 62mensemble-observability-store-topic-hook-job Complete 1/1 62s 62mgraphql-rest-provider-service-lemans-hook-job Failed 0/1 62m 62minventory-service-kafka-hook-job Complete 1/1 29s 64monboard-cas Running 0/1 49m 49monboard-partitions Running 0/1 49m 49monboard-scheduler Running 0/1 49m 49monboard-system Running 0/1 49m 49monboard-tpsm-org-inventory Running 0/1 49m 49mpostgresql-backup-cleanup-######## Complete 1/1 3s 15mpostgresql-backup-cleanup-######## Complete 1/1 3s 5m27spostgresql-init-job Failed 0/1 76m 76mpostgresql-migration-job Failed 0/1 76m 76mscheduled-backup-######## Complete 1/1 3s 55mspring-ingestion-service-lemans-hook-job Failed 0/1 62m 62mtas-ingestion-service-lemans-hook-job Failed 0/1 61m 61mtemporal-setup-db Failed 0/1 64m 64mvss-cloud-accounts-service-lemans-hook-job Failed 0/1 64m 64mTanzu Hub 10.4
The postgresql-init and postgresql-migration jobs don’t have a wait set for the postgresql cluster to be functional. This might lead to a race condition if the postgresql cluster takes longer than expected to start. This creates Secrets and Configmaps that don't match credentials in the Postgres cluster for the init and migration jobs and leads to the invalid credential errors.
In Kubernetes, Jobs are meant to run to completion. If they fail due to an external factor (like the DB not being ready yet), they often "back off." However, if the failure is due to a specific credential secret that was updated after the job started, the job might keep retrying with the old environment variables.
The following procedure can be run from an SSH to the Registry VM in the Tanzu Hub Bosh deployment.
The following command filters for all jobs in the tanzusm namespace with a status of Failed and deletes them. Since the jobs are missing but defined in the desired state, kapp-controller recreates them, this time pulling the correct, synchronized credentials.
# kubectl get jobs -n tanzusm | grep Failed | tail -n+2 | awk '{print $1}' | xargs kubectl delete jobs -n tanzusm
Once deleted, the kapp-controller will notice the discrepancy. Watch the jobs being recreated in real-time:
# kubectl get jobs -n tanzusm -w
After the new jobs appear, check the status of the postgresql-init-job. It should transition from Running to Complete. You can also verify the logs of the new pod to ensure the authentication error is gone:
# kubectl get pods -n tanzusm | grep postgresql-init# kubectl logs -n tanzusm <new-pod-name>
As you saw in the get jobs output, several "hook-jobs" (like ensemble, inventory, and tas-ingestion) were also failing. These depend on the database being initialized. Once the postgresql-init-job succeeds, these dependent jobs should either automatically succeed on their next retry or can be deleted using the same command in the first step to force a fresh start.
Deleting the job is the cleanest way to force Kubernetes to re-read the latest Secrets and ConfigMaps.