VCF Automation 9 UI shows "no healthy upstream".

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

VCF Automation 9 UI shows "no healthy upstream".
Upon reboot or booting post-upgrade, the tenant-manager pods can fail to come up.

We can validate this by checking the pods status by getting an SSH session to the VCF Automation appliance and running the following command

kubectl -n prelude get pods
POD                READY                STATE                RESTARTS                AGE
tenant-manager-0 0/1 CrashLoopBackOff 30 ((##s ago) 11d
tenant-manager-1 0/1 CrashLoopBackOff 30 ((##s ago) 11d
tenant-manager-2 0/1 CrashLoopBackOff 30 (##s ago) 11d

We can further investigate by "describing" one of the pods to see that status in detail:

kubectl -n prelude describe pod tenant-manager-0

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled #m#s (x# over 175m) kubelet Container image "registry.vmsp-platform.svc.cluster.local:5000/images/tenant-manager:9.0.1-0-24965341" already present on machine
Normal Created #m#s (x# over 175m) kubelet Created container: app
Warning BackOff #m#s (x# over 172m) kubelet Back-off restarting failed container app in pod tenant-manager-0_prelude(########-####-####-####-############)

We then check the logs for this pod and find Postgres related errors like:

kubectl -n prelude logs tenant-manager-0

{"level": "INFO","message": "Error starting application: Error connecting to the database: jdbc:postgresql://vcfapostgres.prelude.svc.cluster.local:5432/tenantmanager?socketTimeout=90&ssl=verify-full&sslrootcert=/vmsp-platform-trust/bundle.pem","logger": "cell.log","time": "####-##-##T##:##:##.###Z"}

They then check out vcfapostgres-1 pod logs. bl which gets stuck with a lock on the provisioning-db:

kubectl -n prelude logs vcfapostgres-1

########.###### 0 provisioning_db provisioning_db_owner_user ###.###.###.###(#####) ####### 108
LOG: process ####### still waiting for ExclusiveLock on tuple (1,1) of relation 26008 of database 17046 after 1000.063 ms
 ########.###### 0 provisioning_db provisioning_db_owner_user ###.###.###.###(#####) ####### 109
DETAIL: Process holding the lock: #######. Wait queue: #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######.

Environment

VCF Automation 9.x (VMSP Cluster)

Cause

During vCenter service account rotation it is possible for the vsphere-csi-controller to cache the old account credentials which might be no longer valid. This causes problems whenever a service pod needs to attach/detach a PVC in vCenter's Cloud Native Storage system. It can also cause other problems related to vCenter connectivity that is required for VMSP cluster management.

Operations like PVC attach/detach, node resize can fail.

Logs from the vSphere CSI controller can show entries like this:

kubectl logs deployments/vsphere-csi-controller -n kube-system
I1101 06:17:15.665182 1 controller.go:146] Failed to reconcile volume attachments: failed to list volumes: failed to list volumes:

Resolution

Issue is fixed in VCF Automation 9.0.2

Workaround:

To workaround this issue in VCF Automation 9.0.0 and 9.0.1 please using the following process:

Make sure that the credentials the cluster is using are valid.

Get the vcenter-main-secret name and extract the username

kubectl get secrets -A | grep vcenter-main-secret
vmsp-platform   management-vcenter-main-secret                                    Opaque                                #     #d##h

kubectl get secret -n vmsp-platform <secret name> -oyaml

Copy the base64 encoded username (data.vCenterUsername) and print it out

echo "############################################" | base64 -d

Output:
[email protected]

Copy the base64 encoded password (data.vCenterPassword) and print it out also:

echo "############################" | base64 -d && echo 

Output:
####################

Check in vCenter that the account found in the previous command exists.

Login to vSphere Client using the service account
Logging into the vSphere Client using the service account confirms that the credentials are correct and valid.

If the username is missing or expired you need to remediate the account using Appendix 6 in KB article:

VCF Services Platform Cluster Health Checks

Restart the vSphere CSI controller deployment

kubectl rollout restart deployments/vsphere-csi-controller -n kube-system

Output:
deployment.apps/vsphere-csi-controller restarted

Restart the nodes, one by one from vCenter.
It is best to restart all the nodes to ensure that every system that was down is able to come back up properly.

Open SSH sessions to each of the VCF Automation nodes and run the following:

export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes -owide
NAME                     STATUS   ROLES           AGE     VERSION            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
management-#####-#####   Ready    control-plane   #d#h   v1.34.1+vmware.3   ###.###.###.###   ###.###.###.###   VMware Photon OS/Linux   6.1.158-2.ph5    containerd://2.1.4+vmware.8-fips

For each of the VCF Automation nodes, login using SSH and issue the reboot command:

sudo reboot

Once the appliances come back up in vCenter we can monitor the status of the VCF Automation pods coming back up using:

export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes -owide  --watch
NAME                     STATUS   ROLES           AGE     VERSION            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
management-#####-#####   NotReady    control-plane   #d#h   v1.34.1+vmware.3   ###.###.###.###   ###.###.###.###   VMware Photon OS/Linux   6.1.158-2.ph5    containerd://2.1.4+vmware.8-fips