A TKC migrated from 7.x to 8.x (8.0U3 with VKS <= 3.1.0-embedded) is stuck when upgrading the worker nodes on the Cluster.
This can happen to a TKC with a TKr version upgrade even after the migration has completed. No new worker nodes (nodes belonging to node pools) with the updated TKr version are being rolled out.
CAPI controller logs would show the following errors for the stale MachineSet associated to the Cluster:
Log file: /var/log/pods/svc-tkg-domain-c####_capi-controller-manager-##########-#####_########-####-####-####-##########/manager/0.log.YYYYMMDD-HHMMSS
YYYY-MM-DDTHH:MM:58.144557507Z stderr F E0921 09:20:58.144509 1 controller.go:329] "Reconciler error" err="failed to retrieve KubeadmConfigTemplate external object \"umbc-development\"/\"umbc-shared-tools-workers-####\": KubeadmConfigTemplate.bootstrap.cluster.x-k8s.io \"umbc-shared-tools-workers-rdqw6\" not found"
controller="machineset"
controllerGroup="cluster.x-k8s.io"
controllerKind="MachineSet"
MachineSet="umbc-development/umbc-shared-tools-default-nodepool-##-##"
namespace="umbc-development"
name="umbc-shared-tools-default-nodepool-##-##" reconcileID="########-####-####-####-############"
Stale MachineSet's infrastructureRef would point to WCPMachineTemplate
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineSet
metadata:
annotations:
machinedeployment.clusters.x-k8s.io/desired-replicas: '4'
machinedeployment.clusters.x-k8s.io/max-replicas: '5'
machinedeployment.clusters.x-k8s.io/revision: '22'
creationTimestamp: 'YYYY-MM-DDT09:52:34Z'
generation: 4
labels:
cluster.x-k8s.io/cluster-name: tkc
cluster.x-k8s.io/deployment-name: tkc-workers-6v7pn
machine-template-hash: ####683303-9j7r9
run.tanzu.vmware.com/node-pool: workers
run.tanzu.vmware.com/worker-deployment-id: ''
name: tkc-workers-6v7pn-######
namespace: some-ns
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: MachineDeployment
name: tkc-workers-6v7pn
uid: d2ddade8-e6fe-42c9-ac62-0720156cab39
uid: 440a29ca-c98e-466f-bcd1-0fd926c51f1a
spec:
clusterName: tkc
deletePolicy: Random
replicas: 4
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: tkc
machine-template-hash: 2270683303-9j7r9
run.tanzu.vmware.com/node-pool: workers
run.tanzu.vmware.com/worker-deployment-id: ''
template:
metadata:
labels:
cluster.x-k8s.io/cluster-name: tkc
machine-template-hash: 2270683303-9j7r9
run.tanzu.vmware.com/node-pool: workers
run.tanzu.vmware.com/worker-deployment-id: ''
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: tkc-workers-4zsxf
namespace: some-ns
clusterName: tkc
infrastructureRef:
apiVersion: infrastructure.cluster.vmware.com/v1beta1
kind: WCPMachineTemplate <=============== leftover resource from migration
name: tkc-workers-lx86f
namespace: some-ns
version: v1.27.10+vmware.1-fips.1
vCenter Server 7.x
vCenter Server 8.x
VKS version 3.1.1 or lower.
This issue occurs when a stale MachineSet from the 7.x to 8.x migration is left over after the migration has completed.
Presence of this MachineSet does not allow the rollout of the new worker nodes (belonging to the new MachineSet with the upgraded TKr version) to progress.
kubectl delete machineset <name> -n <namespace>