Control Plane Nodes Report "SchedulingDisabled" and "EvictionThresholdMet" Due to Memory Exhaustion

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When cluster nodes are queried with command "kubectl get nodes", the control plane node status is displayed as SchedulingDisabled with the role listed as <none>

NAME STATUS ROLES AGE VERSION
[Control_Plane_Node_1] Ready control-plane 2d15h v1.32.7+vmware.3-fips
[Control_Plane_Node_2] Ready,SchedulingDisabled <none> 26h v1.32.7+vmware.3-fips

Describing the affected node using kubectl describe node [Control_Plane_Node_2] -n [Namespace], memory exhaustion is indicated by an EvictionThresholdMet warning:

Events:    
Type       Reason                 Age                        From      Message
-----      --------               -----                      -----     --------
Warning    EvictionThresholdMet   4m18s (x1133 over 4h27m)   kubelet   Attempting to reclaim memory

Querying the machines in the Supervisor context using the command "kubectl get machines -A" indicates the affected control plane nodes have a READY state of False and a PHASE of Pending:

NAMESPACE     NAME                     CLUSTER    NODE NAME                READY   AVAILABLE   UP-TO-DATE   PHASE     AGE    VERSION

[Namespace]   [Control_Plane_Node_1]   [Cluster]  [Control_Plane_Node_1]   True    True        True         Running   207d   v1.32.7+vmware.3-fips

[Namespace] [Control_Plane_Node_2] [Cluster] [Control_Plane_Node_2] False False True Pending 13h v1.32.7+vmware.3-fips

The cluster is deployed with insufficient resources (such as the best-effort-xsmall virtual machine class), which blocks safe node rotations and etcd quorum maintenance.
Reviewing the logs for the capi-controller-manager pod with command "kubectl logs <capi-controller-manager-pod-name> -n svc-tkg-domain-c8" within the supervisor namespace reveals KubeadmControlPlane reconciler errors indicating a failure to move etcd leadership:

E0521 HH:MM:SS.##### 1 controller.go:474] "Reconciler error" err="failed to move leadership to candidate Machine [Machine-ID]: failed to create etcd client: etcd leader is reported as [Node-ID] with name \"[Machine-ID]m\", but we couldn't find a corresponding Node in the cluster" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="[Machine-ID]" reconcileID="[Reconcile-ID]"

Environment

VMware vSphere Kubernetes Service

Cause

This issue occurs when the cluster is deployed with insufficient resources, such as best-effort-xsmall virtual machine class.
These resource constraints prevent safe node rotations and etcd quorum maintenance. As a result, memory exhaustion on the control plane node triggers an EvictionThresholdMet condition by the kubelet, and the node is marked as SchedulingDisabled to protect cluster stability.
This also prevents etcd from functioning correctly, causing the KubeadmControlPlane controller to fail when attempting to move leadership to the degraded node.

Resolution

Update the VKS Cluster VM Class to a guaranteed class type to increase the CPU and memory resources of the control plane node. For detailed instructions, see Update a VKS Cluster by Editing the VM Class. If the cluster reconciliation fails to proceed, contact Broadcom Technical Support.

Additional Information

To avoid overcommitting resources, production workloads should use the guaranteed class type. To avoid running out of memory, do not use the small or extra small class size for any worker node where workloads are deployed in any environment (development, test, or production) as mentioned in Using VM Classes with VKS Clusters.