Control Plane Nodes Report "SchedulingDisabled" and "EvictionThresholdMet" Due to Memory Exhaustion
search cancel

Control Plane Nodes Report "SchedulingDisabled" and "EvictionThresholdMet" Due to Memory Exhaustion

book

Article ID: 441447

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • When cluster nodes are queried with command "kubectl get nodes", the control plane node status is displayed as SchedulingDisabled with the role listed as <none>

NAME                     STATUS                     ROLES           AGE     VERSION
[Control_Plane_Node_1]   Ready                      control-plane   2d15h   v1.32.7+vmware.3-fips
[Control_Plane_Node_2]   Ready,SchedulingDisabled   <none>          26h     v1.32.7+vmware.3-fips

  • Describing the affected node using kubectl describe node [Control_Plane_Node_2] -n [Namespace], memory exhaustion is indicated by an EvictionThresholdMet warning:
Events:    
Type       Reason                 Age                        From      Message
-----      --------               -----                      -----     --------
Warning    EvictionThresholdMet   4m18s (x1133 over 4h27m)   kubelet   Attempting to reclaim memory
  • Querying the machines in the Supervisor context using the command "kubectl get machines -A" indicates the affected control plane nodes have a READY state of False and a PHASE of Pending:
NAMESPACE     NAME                     CLUSTER    NODE NAME                READY   AVAILABLE   UP-TO-DATE   PHASE     AGE    VERSION
[Namespace]   [Control_Plane_Node_1]   [Cluster]  [Control_Plane_Node_1]   True    True        True         Running   207d   v1.32.7+vmware.3-fips
[Namespace]   [Control_Plane_Node_2]   [Cluster]  [Control_Plane_Node_2]   False   False       True         Pending   13h    v1.32.7+vmware.3-fips
  • The cluster is deployed with insufficient resources (such as the best-effort-xsmall virtual machine class), which blocks safe node rotations and etcd quorum maintenance.

  • Reviewing the logs for the capi-controller-manager pod with command "kubectl logs <capi-controller-manager-pod-name> -n svc-tkg-domain-c8" within the supervisor namespace reveals KubeadmControlPlane reconciler errors indicating a failure to move etcd leadership:
E0521 HH:MM:SS.##### 1 controller.go:474] "Reconciler error" err="failed to move leadership to candidate Machine [Machine-ID]: failed to create etcd client: etcd leader is reported as [Node-ID] with name \"[Machine-ID]m\", but we couldn't find a corresponding Node in the cluster" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="[Machine-ID]" reconcileID="[Reconcile-ID]"

Environment

VMware vSphere Kubernetes Service

Cause

  • This issue occurs when the cluster is deployed with insufficient resources, such as best-effort-xsmall virtual machine class.

  • These resource constraints prevent safe node rotations and etcd quorum maintenance. As a result, memory exhaustion on the control plane node triggers an EvictionThresholdMet condition by the kubelet, and the node is marked as SchedulingDisabled to protect cluster stability.

  • This also prevents etcd from functioning correctly, causing the KubeadmControlPlane controller to fail when attempting to move leadership to the degraded node.

Resolution

Update the VKS Cluster VM Class to a guaranteed class type to increase the CPU and memory resources of the control plane node. For detailed instructions, see Update a VKS Cluster by Editing the VM Class. If the cluster reconciliation fails to proceed, contact Broadcom Technical Support.

Additional Information

To avoid overcommitting resources, production workloads should use the guaranteed class type. To avoid running out of memory, do not use the small or extra small class size for any worker node where workloads are deployed in any environment (development, test, or production) as mentioned in Using VM Classes with VKS Clusters.