TKGM Workload cluster stuck in updating state, new nodes fail to update providerID from vSphere
book
Article ID: 374815
calendar_today
Updated On:
Products
VMware Tanzu Kubernetes Grid Management
Issue/Introduction
After performing an update to a TKGm workload cluster ControlPlanes that requires Rollout operations, users might see the newly deployed Machine objects stuck in Provisioning state.
The new nodes will be recreated after 20 minute intervals.
Users will see machine, vspherevm, and vspheremachine objects created for the new ControlPlane nodes when running the following command:
kubectl get machine,vspheremachine,vspherevm -A | grep <CLUSTER_NAME>
The cluster is accessible via kubectl commands and shows all original nodes present and healthy. This problem is not related to failures on the workload cluster itself.
CAPI controller manager logs report: Waiting for control plane to pass preflight checks
Describing the KCP (KubeamdControlPlane) object managing this new Machine, users will see failure conditions like: "WaitingForBootstrapData @ Machine/<MACHINE_NAME>"
Describing the KCP will also show the existing ControlPlane nodes reporting: "machine <MACHINE_NAME> does not have APIServerPodHealthy condition"
This will be reported despite functional and healthy APIServer response from all ControlPlane nodes in the cluster
When checking the vSphere web client, the vspherevm doesn't appear in the inventory tree and no new VM's are being created in vSphere.
The vsphere-cloud-controller-manager pod log will report consistent failures like:
WhichVCandBDByNodeID nodID: <NODE_NAME> Failed to create govmomi client. err: ServerFaultCode: Cannot complete login due to an incorrect user name or password.
Environment
TKGm all versions
Cause
This is caused by one of the following conditions on the username and password used by the vSphere Cloud Controller Manager pod:
Incorrect Username
Incorrect Password
User in vSphere is expired
Resolution
Identify the username and password in use by the vSphere Cloud Controller Manager pod with the following command:
kubectl get secret -n tkg-system <MANAGEMENT_CLUSTER_NAME>-vsphere-cpi-data-values -ojsonpath='{.data.values\.yaml}' | base64 -d
Gather the "server", "username" and "password" from the above command. Use them to try logging into the vSphere web client.
If the login fails, log in as [email protected] (or as a user with permissions to update vCenter SSO user credentials).
Find the user in vCenter -> Menu -> Single Sign On -> Users and Groups.
Change the Domain dropdown to the local SSO domain (default is vsphere.local).
Search for the problem username and select it, then click the Edit button.
Update the Current Password and Password fields with the password gathered from the vsphere-cpi-data-values secret.
This should allow the expected user to log in and gather workload cluster node health.