Cluster Rollout Stuck After VCF 9.1 Upgrade with VKS 3.6.1/3.6.2 Due to Missing VirtualMachineGroup Creation and Worker Count Mismatch
search cancel

Cluster Rollout Stuck After VCF 9.1 Upgrade with VKS 3.6.1/3.6.2 Due to Missing VirtualMachineGroup Creation and Worker Count Mismatch

book

Article ID: 439373

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After upgrading to VCF 9.1 with VKS 3.6.1 or 3.6.2, clusters may become stuck in a rollout state if a rollout is triggered during the upgrade process. This occurs when the initial VirtualMachineGroup (VMG) is not created while the number of worker VSphereMachine objects exceeds the total MachineDeployment replica count. In this condition, CAPV does not provision VirtualMachine objects for new worker nodes until a VMG is present, preventing the rollout from progressing and potentially requiring manual intervention to restore cluster health.

Environment

VCF 9.1 + VKS 3.6.1 or VKS 3.6.2

Resolution

1. Pause the Cluster to suspend the CAPI controllers and prevent them from reconciling or reacting to changes during this maintenance.

kubectl patch cluster <CLUSTER_NAME> -n <NAMESPACE> --type merge -p '{"spec":{"paused":true}}'

2. List the existing vSphereMachines for worker nodes and note down the existing number vspheremachines for the impacted worker pool

kubectl get vspheremachine -n <NAMESPACE> --no-headers | grep "<WORKER_PATTERN>"

3. Temporarily update the number of worker nodes for the MachineDeployment Replicas to the total number match the total count found in Step 2. 

kubectl patch machinedeployment <MD_NAME> -n <NAMESPACE> --type merge -p '{"spec":{"replicas":<xxx>}}'

4. Wait for the new nodes to be created. The reconcile period could take up to 10 minutes. 

kubectl get vm -n <NAMESPACE>

5. After all the new nodes are created unpause the cluster

kubectl patch cluster <CLUSTER_NAME> -n <NAMESPACE> --type merge -p '{"spec":{"paused":false}}'

6. Monitor the machine and cluster status to confirm that all machines transition to a healthy state and the cluster returns to Ready.