ControlPlane node rollout stuck in Provisioning state on TKGm clusters

search cancel

ControlPlane node rollout stuck in Provisioning state on TKGm clusters

book

Article ID: 374734

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

After performing an update to a TKGm workload cluster ControlPlanes that requires Rollout operations, users might see the newly deployed Machine objects stuck in Provisioning state.
The new nodes will be recreated after 20 minute intervals.
Users will see machine, vspherevm, and vspheremachine objects created for the new ControlPlane nodes when running the following command:
- kubectl get machine,vspheremachine,vspherevm -A | grep <CLUSTER_NAME>
The cluster is accessible via kubectl commands and shows all original nodes present and healthy. This problem is not related to failures on the workload cluster itself.
CAPI controller manager logs report: Waiting for control plane to pass preflight checks
Describing the KCP (KubeamdControlPlane) object managing this new Machine, users will see failure conditions like: "WaitingForBootstrapData @ Machine/<MACHINE_NAME>"
When checking the vSphere web client, the vspherevm doesn't appear in the inventory tree and no new VM's are being created in vSphere.
CAPV controller manager logs MIGHT or MIGHT NOT report the following logs: error fetching compute cluster resource" err="resource pool"
- NOTE: The CAPV controller might not report this error until the pod is restarted. This is the cause of the failure, and the missing logging can be misleading.

Environment

TKGm all versions

Cause

This failure occurs because the ResourcePool configured during cluster creation has either been manually deleted in vSphere, or DRS has been disabled and re-enabled on vSphere.

See the TKGm Configuration File Variable Reference for vSphere referencing variable VSPHERE_RESOURCE_POOL

Resolution

Add the missing resource pool back to vSphere in order for new VM's to be provisioned.

Identify the expected resource pool name by describing the machine object that is stuck in Provisioning state.

Feedback

thumb_up Yes

thumb_down No