vSphere Kubernetes Service (VKS) Cluster nodes are stuck in NotReady State

Products

VMware vCenter Server VMware vSphere Kubernetes Service

Issue/Introduction

vSphere Kubernetes Service (VKS) 3.6 release introduced the capability to propagate certain node configurations in-place without triggering a node rollout for the Cluster. These include associating a new container registry to the Supervisor, creating a ClusterDomainResolutionEntry custom resource to ensure a DNS is resolvable from the Cluster nodes and updating the trusted certificate configuration for the Cluster via Cluster variables.

Based on the trigger that causes in-place updates, this issue can be seen on a single cluster or all VKS clusters managed by the supervisor. For example, when a new container registry is associated to the Supervisor, the trust configuration is pushed to all VKS Clusters managed by the Supervisor whereas updating the Cluster’s trust configuration via the {{osConfiguration}} variable would limit the update to a single cluster.

Either of these operations may result in some of the nodes in the Cluster to move to a NotReady state.

Confirm whether the Cluster(s) is demonstrating the symptom by checking the following:

Confirm the presence of the clusterregistryconfig, clusterdomainresolutionentry, registryconfig objects in the Supervisor API server.
If not present, confirm whether the osConfiguration variable is set on the Cluster and the trust field was updated before the Cluster entered into a NotReady state.
Check the state of all the Clusters belonging to a namespace or across all namespaces using kubectl get cluster command. The AVAILABLE column of the output should show False for any affected clusters.
For any such cluster, compare the CP DESIRED/CP AVAILABLE and W DESIRED/W AVAILABLE columns, the number of desired v/s available replicas should differ.
Check the state of the nodes for the Cluster using the kubeconfig for the Cluster. Any affected nodes should have their STATUS column be marked as NotReady.
Describe each affected node to see the Ready condition on the node with the REASON column showing KubeletNotReady with the message stating container runtime is down, PLEG (Pod Lifecycle Event Generator) is not healthy. This is a module within the Kubelet (the agent that runs on every node) responsible for bridging the gap between the container runtime and the Kubernetes API.This applies for both control plane and worker nodes.
Specifically for NotReady control plane nodes, if the Ready condition message does not mention the text in step 4, check the pods on the running on the node using the following command.
Get pods running on a node:
kubectl get pods -A --field-selector spec.nodeName=<name-of-not-ready-node>
Identify the pods under error, if any. Check the pod logs to see if you see the "error bind: address already in use"

Environment

vCenter version: 9.1.0

VKS versions: 3.6.0 and 3.6.1

Cause

When the node configuration is propagated in-place (without triggering a rollout), the config of the node is hot replaced by a pod running on the node. This causes the containerd systemd service on the node to restart. If containerd restarts without registering the completion of this pod, the pod ends up being rescheduled in a loop which might cause the service to be frequently restarted.
As a side effect, the containerd process might lose track of the running containers which could cause those containers to be orphaned.

This might block containerd to restart a new container since the port would be already in use by the orphaned container.
Another side effect could be that this loop might cause containerd to undergo multiple restarts within a short period of time which would hit the start limits of systemd on the node, thereby causing containerd service to become unmanaged.

Resolution

VKS Clusters have automatic remediation setup via machine health checks. Some instances of the issue might be automatically resolved by these remediations. For Clusters who have their ETCD quorum broken (due to >1 failed control plane nodes), auto remediation is not attempted to maintain the integrity of ETCD of the cluster. Similarly, for Cluster nodes whose containerd service is not responding, an attempted remediation will be blocked since the node drain would fail. Machine objects corresponding to the nodes displaying any of the symptoms stuck in the Deleting state for more than an hour are an example of a failed remediation being blocked by a unresponsive container runtime.

Node with KubeletNotReady condition due to unhealthy PLEG

Since the container runtime of the node is unresponsive, manual intervention is needed either to move the node back to Ready state or unblock an in-progress automatic remediation.

SSH onto the node, and confirm the state of the containerd service using systemctl status containerd.
Reset the failed state using the command sudo systemctl reset-failed containerd .
Restart the service using the command sudo systemctl start containerd .
Confirm the status of the service using systemctl status containerd to verify a successful restart of the service.

Node with a failing Pod due to bind address already in use

Since the pod cannot be started on the node due to an orphaned process already running, the process needs to be identified and killed manually to ensure containerd can successfully restart the pod.

SSH onto the node, and identify the failing pod using crictl ps -a command.
Identify the port in use by checking the pod logs using crictl logs <container-id> .
Run the command to identify the process using the port sudo ss -tulpn | grep :<port-number> .
Confirm the details of the process using its PID by ps -fp <PID> and matching it against the failed pod.
Clean up the process using sudo kill -9 <PID> . (containerd should eventually be able to replace the pod with a healthy instance.)

The table below shows the ports in use by the system pods on a VKS cluster

Pod Name	Port(s) at Risk
kube-apiserver	6443
etcd	2379, 2380, 2381
kube-scheduler	10259
kube-controller-manager	10257
antrea-agent	10350
antrea-controller	10349