Tanzu Hub postgres-init job failed to complete during hubsm-install errand while installing

Products

VMware Tanzu Platform - Hub

Issue/Introduction

When installing Tanzu Hub 10.4 with 3 Postgresql VMs, the postgresql-init job fails during the hubsm-install errand, returning the following errors in the Opsman GUI:

10:20:29PM: ^ Reconcile failed: message: kapp: Error: waiting on reconcile packageinstall/postgresql-init (packaging.carvel.dev/v1alpha1) namespace: tanzusm:
Finished waiting unsuccessfully:
              Reconcile failed: message: kapp:
               Error: waiting on reconcile job/postgresql-init-job (batch/v1) namespace: tanzusm:
            Finished waiting unsuccessfully:
              Failed with reason BackoffLimitExceeded:
                Job has reached the specified backoff limit
From an SSH to the Registry VM, when checking pods in the tanzusm namespace, you see that all 3 postgresql pods are up and running and the package install completed successfully, but postgresql-init and postgresql-migration job pods are in Error state:

registry/########-####-####-####-############:~$ kubectl get pods -n tanzusm | grep post
postgres-operator-<REPLICASET>-<POD_ID> 1/1 Running 0 60m
postgresql-0 4/4 Running 0 54m
postgresql-1 4/4 Running 0 54m
postgresql-2 4/4 Running 0 54m
postgresql-backup-0 1/1 Running 0 54m
postgresql-backup-cleanup-<BACKUP_ID> 0/1 Completed 0 12m
postgresql-backup-cleanup-<BACKUP_ID> 0/1 Completed 0 2m8s
postgresql-init-job-<POD_ID> 0/1 Error 0 52m
postgresql-init-job-<POD_ID> 0/1 Error 0 52m
postgresql-init-job-<POD_ID> 0/1 Error 0 51m
postgresql-init-job-<POD_ID> 0/1 Error 0 47m
postgresql-init-job-<POD_ID> 0/1 Error 0 41m
postgresql-init-job-<POD_ID> 0/1 Error 0 49m
postgresql-init-job-<POD_ID> 0/1 Error 0 51m
postgresql-migration-job-<POD_ID> 0/1 Error 0 51m
postgresql-migration-job-<POD_ID> 0/1 Error 0 41m
postgresql-migration-job-<POD_ID> 0/1 Error 0 47m
postgresql-migration-job-<POD_ID> 0/1 Error 0 52m
postgresql-migration-job-<POD_ID> 0/1 Error 0 52m
postgresql-migration-job-<POD_ID> 0/1 Error 0 51m
postgresql-migration-job-<POD_ID> 0/1 Error 0 49m
Checking logging on the postgresql-init pods, you see authentication failures:

registry/########-####-####-####-############:~$ kubectl logs -n tanzusm postgresql-init-job-hw4dt
Running postgres init script...
Checking app user...
psql: error: connection to server at "postgresql" (<POD_IP_ADDRESS>), port 5432 failed: FATAL: password authentication failed for user "pgadmin"
As many other pods depend on the postgresql-init and postgresql-migration pods, it is likely you will see secondary job failures resulting from this condition:

registry/########-####-####-####-############:~$ kubectl get jobs -n tanzusm
NAME STATUS COMPLETIONS DURATION AGE
clickhouse-readonly-recovery-job Complete 1/1 4m 82m
contour-contour-certgen Complete 1/1 9s 83m
ensemble-application-metadata-lemans-hook-job Failed 0/1 62m 62m
ensemble-application-metadata-topic-hook-job Complete 1/1 49s 62m
ensemble-observability-store-lemans-hook-job Failed 0/1 62m 62m
ensemble-observability-store-topic-hook-job Complete 1/1 62s 62m
graphql-rest-provider-service-lemans-hook-job Failed 0/1 62m 62m
inventory-service-kafka-hook-job Complete 1/1 29s 64m
onboard-cas Running 0/1 49m 49m
onboard-partitions Running 0/1 49m 49m
onboard-scheduler Running 0/1 49m 49m
onboard-system Running 0/1 49m 49m
onboard-tpsm-org-inventory Running 0/1 49m 49m
postgresql-backup-cleanup-######## Complete 1/1 3s 15m
postgresql-backup-cleanup-######## Complete 1/1 3s 5m27s
postgresql-init-job Failed 0/1 76m 76m
postgresql-migration-job Failed 0/1 76m 76m
scheduled-backup-######## Complete 1/1 3s 55m
spring-ingestion-service-lemans-hook-job Failed 0/1 62m 62m
tas-ingestion-service-lemans-hook-job Failed 0/1 61m 61m
temporal-setup-db Failed 0/1 64m 64m
vss-cloud-accounts-service-lemans-hook-job Failed 0/1 64m 64m

Environment

Tanzu Hub 10.4

Cause

The postgresql-init and postgresql-migration jobs don’t have a wait set for the postgresql cluster to be functional. This might lead to a race condition if the postgresql cluster takes longer than expected to start. This creates Secrets and Configmaps that don't match credentials in the Postgres cluster for the init and migration jobs and leads to the invalid credential errors.

In Kubernetes, Jobs are meant to run to completion. If they fail due to an external factor (like the DB not being ready yet), they often "back off." However, if the failure is due to a specific credential secret that was updated after the job started, the job might keep retrying with the old environment variables.

Resolution

This is resolved in the first patch release of Tanzu Hub: 10.4.1

Workaround:

The following procedure can be run from an SSH to the Registry VM in the Tanzu Hub Bosh deployment.

Clear the Failed Jobs

The following command filters for all jobs in the tanzusm namespace with a status of Failed and deletes them. Since the jobs are missing but defined in the desired state, kapp-controller recreates them, this time pulling the correct, synchronized credentials.

# kubectl get jobs -n tanzusm | grep Failed | tail -n+2 | awk '{print $1}' | xargs kubectl delete jobs -n tanzusm

Monitor for Re-creation

Once deleted, the kapp-controller will notice the discrepancy. Watch the jobs being recreated in real-time:

# kubectl get jobs -n tanzusm -w

Verify the Fix

After the new jobs appear, check the status of the postgresql-init-job. It should transition from Running to Complete. You can also verify the logs of the new pod to ensure the authentication error is gone:

- Get the name of the new init pod:
  
  # kubectl get pods -n tanzusm | grep postgresql-init
- Check the logs:
  
  # kubectl logs -n tanzusm <new-pod-name>

Cascade Recovery

As you saw in the get jobs output, several "hook-jobs" (like ensemble, inventory, and tas-ingestion) were also failing. These depend on the database being initialized. Once the postgresql-init-job succeeds, these dependent jobs should either automatically succeed on their next retry or can be deleted using the same command in the first step to force a fresh start.

Deleting the job is the cleanest way to force Kubernetes to re-read the latest Secrets and ConfigMaps.