kubectl -n prelude get pods
POD READY STATE RESTARTS AGE
tenant-manager-0 0/1 CrashLoopBackOff 30 ((##s ago) 11d
tenant-manager-1 0/1 CrashLoopBackOff 30 ((##s ago) 11d
tenant-manager-2 0/1 CrashLoopBackOff 30 (##s ago) 11d
kubectl -n prelude describe pod tenant-manager-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled #m#s (x# over 175m) kubelet Container image "registry.vmsp-platform.svc.cluster.local:5000/images/tenant-manager:9.0.1-0-24965341" already present on machine
Normal Created #m#s (x# over 175m) kubelet Created container: app
Warning BackOff #m#s (x# over 172m) kubelet Back-off restarting failed container app in pod tenant-manager-0_prelude(########-####-####-####-############)
kubectl -n prelude logs tenant-manager-0
{"level": "INFO","message": "Error starting application: Error connecting to the database: jdbc:postgresql://vcfapostgres.prelude.svc.cluster.local:5432/tenantmanager?socketTimeout=90&ssl=verify-full&sslrootcert=/vmsp-platform-trust/bundle.pem","logger": "cell.log","time": "####-##-##T##:##:##.###Z"}
They then check out vcfapostgres-1 pod logs. bl which gets stuck with a lock on the provisioning-db:
kubectl -n prelude logs vcfapostgres-1
########.###### 0 provisioning_db provisioning_db_owner_user ###.###.###.###(#####) ####### 108
LOG: process ####### still waiting for ExclusiveLock on tuple (1,1) of relation 26008 of database 17046 after 1000.063 ms
########.###### 0 provisioning_db provisioning_db_owner_user ###.###.###.###(#####) ####### 109
DETAIL: Process holding the lock: #######. Wait queue: #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######, #######.
During vCenter service account rotation it is possible for the vsphere-csi-controller to cache the old account credentials which might be no longer valid. This causes problems whenever a service pod needs to attach/detach a PVC in vCenter's Cloud Native Storage system. It can also cause other problems related to vCenter connectivity that is required for VMSP cluster management.
Operations like PVC attach/detach, node resize can fail.
Logs from the vSphere CSI controller can show entries like this:
kubectl logs deployments/vsphere-csi-controller -n kube-system
I1101 06:17:15.665182 1 controller.go:146] Failed to reconcile volume attachments: failed to list volumes: failed to list volumes:
Issue is fixed in VCF Automation 9.0.2
Workaround:
To workaround this issue in VCF Automation 9.0.0 and 9.0.1 please using the following process:
Make sure that the credentials the cluster is using are valid.
Get the vcenter-main-secret name and extract the username
kubectl get secrets -A | grep vcenter-main-secret
vmsp-platform management-vcenter-main-secret Opaque # #d##h
kubectl get secret -n vmsp-platform <secret name> -oyaml
Copy the base64 encoded username (data.vCenterUsername) and print it out
echo "############################################" | base64 -d
Output:
[email protected]
Copy the base64 encoded password (data.vCenterPassword) and print it out also:
echo "############################" | base64 -d && echo
Output:
####################
Check in vCenter that the account found in the previous command exists.
Login to vSphere Client using the service account
Logging into the vSphere Client using the service account confirms that the credentials are correct and valid.
If the username is missing or expired you need to remediate the account using Appendix 6 in KB article:
VCF Services Platform Cluster Health Checks
Restart the vSphere CSI controller deployment
kubectl rollout restart deployments/vsphere-csi-controller -n kube-system
Output:
deployment.apps/vsphere-csi-controller restarted
Restart the nodes, one by one from vCenter.
It is best to restart all the nodes to ensure that every system that was down is able to come back up properly.
Open SSH sessions to each of the VCF Automation nodes and run the following:
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
management-#####-##### Ready control-plane #d#h v1.34.1+vmware.3 ###.###.###.### ###.###.###.### VMware Photon OS/Linux 6.1.158-2.ph5 containerd://2.1.4+vmware.8-fips
For each of the VCF Automation nodes, login using SSH and issue the reboot command:
sudo reboot
Once the appliances come back up in vCenter we can monitor the status of the VCF Automation pods coming back up using:
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes -owide --watch
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
management-#####-##### NotReady control-plane #d#h v1.34.1+vmware.3 ###.###.###.### ###.###.###.### VMware Photon OS/Linux 6.1.158-2.ph5 containerd://2.1.4+vmware.8-fips