Worker node rollout stuck during guest cluster upgrades

Products

VMware vCenter Server 8.0 VMware vSphere Kubernetes Service

Issue/Introduction

A TKC migrated from 7.x to 8.x (8.0U3 with VKS <= 3.1.0-embedded) is stuck when upgrading the worker nodes on the Cluster.

This can happen to a TKC with a TKr version upgrade even after the migration has completed. No new worker nodes (nodes belonging to node pools) with the updated TKr version are being rolled out.

CAPI controller logs would show the following errors for the stale MachineSet associated to the Cluster:

Log file: /var/log/pods/svc-tkg-domain-c####_capi-controller-manager-##########-#####_########-####-####-####-##########/manager/0.log.YYYYMMDD-HHMMSS

YYYY-MM-DDTHH:MM:58.144557507Z stderr F E0921 09:20:58.144509       1 controller.go:329] "Reconciler error" err="failed to retrieve KubeadmConfigTemplate external object \"umbc-development\"/\"umbc-shared-tools-workers-####\": KubeadmConfigTemplate.bootstrap.cluster.x-k8s.io \"umbc-shared-tools-workers-rdqw6\" not found" 
controller="machineset" 
controllerGroup="cluster.x-k8s.io" 
controllerKind="MachineSet" 
MachineSet="umbc-development/umbc-shared-tools-default-nodepool-##-##" 
namespace="umbc-development" 
name="umbc-shared-tools-default-nodepool-##-##" reconcileID="########-####-####-####-############"

Stale MachineSet's infrastructureRef would point to WCPMachineTemplate

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machinedeployment.clusters.x-k8s.io/desired-replicas: '4'
    machinedeployment.clusters.x-k8s.io/max-replicas: '5'
    machinedeployment.clusters.x-k8s.io/revision: '22'
  creationTimestamp: 'YYYY-MM-DDT09:52:34Z'
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: tkc
    cluster.x-k8s.io/deployment-name: tkc-workers-6v7pn
    machine-template-hash: ####683303-9j7r9
    run.tanzu.vmware.com/node-pool: workers
    run.tanzu.vmware.com/worker-deployment-id: ''
  name: tkc-workers-6v7pn-######
  namespace: some-ns
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineDeployment
    name: tkc-workers-6v7pn
    uid: d2ddade8-e6fe-42c9-ac62-0720156cab39
  uid: 440a29ca-c98e-466f-bcd1-0fd926c51f1a
spec:
  clusterName: tkc
  deletePolicy: Random
  replicas: 4
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: tkc
      machine-template-hash: 2270683303-9j7r9
      run.tanzu.vmware.com/node-pool: workers
      run.tanzu.vmware.com/worker-deployment-id: ''
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: tkc
        machine-template-hash: 2270683303-9j7r9
        run.tanzu.vmware.com/node-pool: workers
        run.tanzu.vmware.com/worker-deployment-id: ''
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: tkc-workers-4zsxf
          namespace: some-ns
      clusterName: tkc
      infrastructureRef:
        apiVersion: infrastructure.cluster.vmware.com/v1beta1
        kind: WCPMachineTemplate       <=============== leftover resource from migration
        name: tkc-workers-lx86f
        namespace: some-ns
      version: v1.27.10+vmware.1-fips.1

Environment

vCenter Server 7.x

vCenter Server 8.x

VKS version 3.1.1 or lower.

Cause

This issue occurs when a stale MachineSet from the 7.x to 8.x migration is left over after the migration has completed.

Presence of this MachineSet does not allow the rollout of the new worker nodes (belonging to the new MachineSet with the upgraded TKr version) to progress.

Resolution

Login to the supervisor control plane VM as the root user. See, Troubleshooting vSphere Supervisor Control Plane VMs.
Delete the stale MachineSet object using the below command:
```
kubectl delete machineset <name> -n <namespace>
```