ControlPlane node rollout stuck in Provisioning state on TKGm clusters
book
Article ID: 374734
calendar_today
Updated On:
Products
VMware Tanzu Kubernetes Grid Management
Issue/Introduction
After performing an update to a TKGm workload cluster ControlPlanes that requires Rollout operations, users might see the newly deployed Machine objects stuck in Provisioning state.
The new nodes will be recreated after 20 minute intervals.
Users will see machine, vspherevm, and vspheremachine objects created for the new ControlPlane nodes when running the following command:
kubectl get machine,vspheremachine,vspherevm -A | grep <CLUSTER_NAME>
The cluster is accessible via kubectl commands and shows all original nodes present and healthy. This problem is not related to failures on the workload cluster itself.
CAPI controller manager logs report: Waiting for control plane to pass preflight checks
Describing the KCP (KubeamdControlPlane) object managing this new Machine, users will see failure conditions like: "WaitingForBootstrapData @ Machine/<MACHINE_NAME>"
When checking the vSphere web client, the vspherevm doesn't appear in the inventory tree and no new VM's are being created in vSphere.
CAPV controller manager logs MIGHT or MIGHT NOT report the following logs: error fetching compute cluster resource" err="resource pool"
NOTE: The CAPV controller might not report this error until the pod is restarted. This is the cause of the failure, and the missing logging can be misleading.
Environment
TKGm all versions
Cause
This failure occurs because the ResourcePool configured during cluster creation has either been manually deleted in vSphere, or DRS has been disabled and re-enabled on vSphere.