Guest Cluster worker node keeps rebuilding itself every 12 minutes as a result of an upgrade attempt.

Products

VMware vSphere Kubernetes Service

Issue/Introduction

1. The Guest Cluster upgrade from version 1.23.8 to 1.24.11 is stuck since one of the worker nodes keeps on rebuilding itself every 12 minutes on the existing Tkr version.

2. In a single Control Plane node cluster, the Control Plane upgrade would be stuck since it is waiting for the worker nodes to be upgraded first. In a 3 node Control Plane cluster, the control plane nodes are upgraded and running with v1.24.11+vmware.1-fips.1 while MD still trying to roll out Workers with v1.23.8+vmware.3.

3. In the CAPI-Controller-manager logs, the Machine Health Check controller deleting the concerned machine because it fails the health check.

machinehealthcheck_controller.go:434] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" reason="UnhealthyNode" message="Condition Ready on node is reporting status False for more than 12m0s

4. When logging in inside the affected worker node, the "kube-proxy" and "antrea" pods are crashing in "ImagePullBackOff" state.

5. Both Antrea and kube-proxy pods are trying to pull the image for the destination Tkr version even though the kubelet is still on the old tkr version. In the example below, the node has antrea image v1.5.3 available locally but the CNI plugin is still trying to pull v1.7.2

E1016 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = NotFound desc = failed to pull and unpack image \"localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3\": failed to resolve reference \"localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3\": localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3: not found" image="localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3"
FATA[0000] pulling image: rpc error: code = NotFound desc = failed to pull and unpack image "localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3": failed to resolve reference "localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3": localhost:5000/vmware.io/antrea/antrea:v1.7.2_vmware.3: not found

Environment

vSphere with Tanzu

Tanzu Kubernetes Cluster

Cause

This happens when a guest cluster upgrade is triggered on top of an unfinished cluster upgrade or migration which causes an unexpected CNI upgrade and therefore the nodes go into a "NotReady" state.

Resolution

The fix is to manually update/fix the Machine Deployment to continue the upgrade. Please ensure that the Control Plane nodes are healthy before proceeding with the same.

1. Manually update the Machine Deployment TKR version from v1.23.8+vmware.3 to v1.24.11+vmware.1-fips.1. To do the same, follow the steps below.

A. Run kubectl -n <namespace> edit MachineDeployment <name-of-the-machine-deployment>
B. Change "Spec.Version" to "v1.24.11+vmware.1-fips.1". The same is below the "Infrastructure Ref:"

For example,

Infrastructure Ref:
API Version: vmware.infrastructure.cluster.x-k8s.io/v1beta1
Kind: VSphereMachineTemplate
Name: <machine-deployment-name>
Namespace: <namespace-name>
Version: v1.23.8+vmware.3 <--------- (change to v1.24.11+vmware.1-fips.1)

C. Save the Machine Deployment.

This should lead to the worker node re-creating themselves on the new 1.24.11 version.

2. If any worker nodes fail to rollover to the new version, check the CAPI log to see if there is a node drain issue. If so, then either drain the nodes manually for the rollout to happen or apply the annotation "machine.cluster.x-k8s.io/exclude-node-draining" (https://cluster-api.sigs.k8s.io/reference/labels_and_annotations) to continue with the worker node rollout.