Symptoms:
- On a vSphere with Tanzu Guest Cluster, worker nodes are continuously redeployed
- A new MachineSet is created under the MachineDeployment associated with 1 or more NodePools created under the Guest Cluster, leading to new node deployments
- This rollout behavior may cause workloads to fail due to the repeated failover
- From an SSH to the Supervisor Cluster, the following symptoms will be present:
- The Guest Cluster TKC object will not change from READY:TRUE state
kubectl get tkc -A
NAMESPACE NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
test-namespace test-cluster 1 5 v1.23.8---vmware.2-tkg.2-zshippable 83m True True
- When describing the TKC object, the events will show repeated PhaseChanged from Updating to Running and back:
kubectl describe tkc -n test-namespace test-cluster |tail -10
Phase: updating
Total Worker Replicas: 5
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal PhaseChanged 8m (x2 over 10m) vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller cluster changes from running phase to updating phase
Normal PhaseChanged 5m (x2 over 10m) vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller cluster changes from updating phase to running phase
Normal PhaseChanged 2m vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller cluster changes from running phase to updating phase
- Getting MachineSets from the namespace will show repeated newly created MS's for the problem cluster and nodepool:
kubectl get machineset -A
NAMESPACE NAME CLUSTER REPLICAS READY AVAILABLE AGE VERSION
test-namespace test-cluster-node-pool-1-dxx9p-5c7bc9f65 test-cluster 0 39m v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-6666b54746 test-cluster 1 98s v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-66699868f9 test-cluster 1 1 1 14m v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-6874966dbd test-cluster 0 0 0 38m v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-6c87d8d4b test-cluster 0 15m v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-746bd5bfb4 test-cluster 0 8m35s v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-77587446bf test-cluster 1 1 1 7m16s v1.23.8+vmware.2
test-namespace test-cluster-node-pool-1-dxx9p-7b5f868d8f test-cluster 0 2m48s v1.23.8+vmware.2
test-namespace test-cluster-node-pool-2-lqtpm-57d98cc8d6 test-cluster 3 3 3 79m v1.23.8+vmware.2
- The transitions will be noted in the capi-controller-manager logs. The update event will report repeated creation and rotation of the KubeadmConfigTemplate for the MachineDeployment:
I0427 21:20:29.753835 1 reconcile_state.go:455] controller/topology/cluster "msg"="Rotating KubeadmConfigTemplate/ test-cluster-node-pool-1-bootstrap-hx2nz, new name test-cluster-node-pool-1-bootstrap-q76sh" "machineDeployment name"=" test-cluster-node-pool-1-dxx9p" "machineDeployment topologyName"="node-pool-1" "name"=" test-cluster" "namespace"="test-namespace" "object"=" test-cluster-node-pool-1-bootstrap-hx2nz" "object groupVersion"="bootstrap.cluster.x-k8s.io/v1beta1" "object kind"="KubeadmConfigTemplate" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster"
I0427 21:20:29.753962 1 reconcile_state.go:455] controller/topology/cluster "msg"="Creating KubeadmConfigTemplate/test-cluster-node-pool-1-bootstrap-q76sh" "machineDeployment name"=" test-cluster-node-pool-1-dxx9p" "machineDeployment topologyName"="node-pool-1" "name"=" test-cluster" "namespace"="test-namespace" "object"=" test-cluster-node-pool-1-bootstrap-hx2nz" "object groupVersion"="bootstrap.cluster.x-k8s.io/v1beta1" "object kind"="KubeadmConfigTemplate" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster"
- If the environment is using vSphere 8.x and an associated Supervisor Cluster, describing the Cluster object will indicate a large Generation number, this number will be larger the longer the condition has been present:
kubectl describe cluster -n test-namespace test-cluster | grep Generation
Generation: 68