Tanzu Hub postgres-init job failed to complete during hubsm-install errand while installing
search cancel

Tanzu Hub postgres-init job failed to complete during hubsm-install errand while installing

book

Article ID: 439720

calendar_today

Updated On:

Products

VMware Tanzu Platform - Hub

Issue/Introduction

  • When installing Tanzu Hub 10.4 with 3 Postgresql VMs, the postgresql-init job fails during the hubsm-install errand, returning the following errors in the Opsman GUI:

    10:20:29PM:  ^ Reconcile failed: message: kapp: Error: waiting on reconcile packageinstall/postgresql-init (packaging.carvel.dev/v1alpha1) namespace: tanzusm:
               Finished waiting unsuccessfully:  
                  Reconcile failed: message: kapp:  
                   Error: waiting on reconcile job/postgresql-init-job (batch/v1) namespace: tanzusm:  
                Finished waiting unsuccessfully:  
                  Failed with reason BackoffLimitExceeded:  
                    Job has reached the specified backoff limit

  • From an SSH to the Registry VM, when checking pods in the tanzusm namespace, you see that all 3 postgresql pods are up and running and the package install completed successfully, but postgresql-init and postgresql-migration job pods are in Error state:

    registry/########-####-####-####-############:~$ kubectl get pods -n tanzusm | grep post
    postgres-operator-<REPLICASET>-<POD_ID>      1/1     Running     0          60m
    postgresql-0                                 4/4     Running     0          54m
    postgresql-1                                 4/4     Running     0          54m
    postgresql-2                                 4/4     Running     0          54m
    postgresql-backup-0                          1/1     Running     0          54m
    postgresql-backup-cleanup-<BACKUP_ID>        0/1     Completed   0          12m
    postgresql-backup-cleanup-<BACKUP_ID>        0/1     Completed   0          2m8s
    postgresql-init-job-<POD_ID>                 0/1     Error       0          52m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          52m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          51m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          47m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          41m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          49m
    postgresql-init-job-<POD_ID>                 0/1     Error       0          51m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          51m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          41m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          47m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          52m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          52m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          51m
    postgresql-migration-job-<POD_ID>            0/1     Error       0          49m


  • Checking logging on the postgresql-init pods, you see authentication failures:

    registry/########-####-####-####-############:~$ kubectl logs -n tanzusm postgresql-init-job-hw4dt
    Running postgres init script...
    Checking app user...
    psql: error: connection to server at "postgresql" (<POD_IP_ADDRESS>), port 5432 failed: FATAL:  password authentication failed for user "pgadmin"

  • As many other pods depend on the postgresql-init and postgresql-migration pods, it is likely you will see secondary job failures resulting from this condition:

    registry/########-####-####-####-############:~$ kubectl get jobs -n tanzusm
    NAME                                         STATUS      COMPLETIONS   DURATION   AGE
    clickhouse-readonly-recovery-job             Complete    1/1           4m         82m
    contour-contour-certgen                      Complete    1/1           9s         83m
    ensemble-application-metadata-lemans-hook-job Failed      0/1           62m        62m
    ensemble-application-metadata-topic-hook-job Complete    1/1           49s        62m
    ensemble-observability-store-lemans-hook-job Failed      0/1           62m        62m
    ensemble-observability-store-topic-hook-job  Complete    1/1           62s        62m
    graphql-rest-provider-service-lemans-hook-job Failed      0/1           62m        62m
    inventory-service-kafka-hook-job             Complete    1/1           29s        64m
    onboard-cas                                  Running     0/1           49m        49m
    onboard-partitions                           Running     0/1           49m        49m
    onboard-scheduler                            Running     0/1           49m        49m
    onboard-system                               Running     0/1           49m        49m
    onboard-tpsm-org-inventory                   Running     0/1           49m        49m
    postgresql-backup-cleanup-########           Complete    1/1           3s         15m
    postgresql-backup-cleanup-########           Complete    1/1           3s         5m27s
    postgresql-init-job                          Failed      0/1           76m        76m
    postgresql-migration-job                     Failed      0/1           76m        76m
    scheduled-backup-########                    Complete    1/1           3s         55m
    spring-ingestion-service-lemans-hook-job     Failed      0/1           62m        62m
    tas-ingestion-service-lemans-hook-job        Failed      0/1           61m        61m
    temporal-setup-db                            Failed      0/1           64m        64m
    vss-cloud-accounts-service-lemans-hook-job   Failed      0/1           64m        64m

Environment

Tanzu Hub 10.4

Cause

The postgresql-init and postgresql-migration jobs don’t have a wait set for the postgresql cluster to be functional. This might lead to a race condition if the postgresql cluster takes longer than expected to start. This creates Secrets and Configmaps that don't match credentials in the Postgres cluster for the init and migration jobs and leads to the invalid credential errors.

In Kubernetes, Jobs are meant to run to completion. If they fail due to an external factor (like the DB not being ready yet), they often "back off." However, if the failure is due to a specific credential secret that was updated after the job started, the job might keep retrying with the old environment variables.

Resolution

This is resolved in the first patch release of Tanzu Hub: 10.4.1

 

Workaround:

The following procedure can be run from an SSH to the Registry VM in the Tanzu Hub Bosh deployment.

 

Clear the Failed Jobs

The following command filters for all jobs in the tanzusm namespace with a status of Failed and deletes them. Since the jobs are missing but defined in the desired state, kapp-controller recreates them, this time pulling the correct, synchronized credentials.

 

# kubectl get jobs -n tanzusm | grep Failed | tail -n+2 | awk '{print $1}' | xargs kubectl delete jobs -n tanzusm

 

Monitor for Re-creation

Once deleted, the kapp-controller will notice the discrepancy. Watch the jobs being recreated in real-time:

 

# kubectl get jobs -n tanzusm -w

 

Verify the Fix

After the new jobs appear, check the status of the postgresql-init-job. It should transition from Running to Complete. You can also verify the logs of the new pod to ensure the authentication error is gone:

 

    • Get the name of the new init pod:

      # kubectl get pods -n tanzusm | grep postgresql-init
       
    • Check the logs:

      # kubectl logs -n tanzusm <new-pod-name>

 

Cascade Recovery

As you saw in the get jobs output, several "hook-jobs" (like ensemble, inventory, and tas-ingestion) were also failing. These depend on the database being initialized. Once the postgresql-init-job succeeds, these dependent jobs should either automatically succeed on their next retry or can be deleted using the same command in the first step to force a fresh start.

 

Deleting the job is the cleanest way to force Kubernetes to re-read the latest Secrets and ConfigMaps.