Kubernete node keep flapping" between Ready and NotReady status in vSphere Kubernetes Service cluster (VKS)
search cancel

Kubernete node keep flapping" between Ready and NotReady status in vSphere Kubernetes Service cluster (VKS)

book

Article ID: 431020

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • The cluster node keep getting into NotReady state.

    # kubectl get node -o wide 

    NAME                           STATUS     ROLES           AGE   VERSION                 INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
    cp-name-hndtr-mqn4m           Ready      control-plane   27d   v1.33.3+vmware.1-fips   10.244.0.54   <none>        VMware Photon OS/Linux   6.1.148-1.ph5    containerd://2.0.6+vmware.1-fips
    worker-name-np-6ora-sjp8qbk   Ready      <none>          27d   v1.33.3+vmware.1-fips   10.#.#.55     <none>        VMware Photon OS/Linux   6.1.148-1.ph5    containerd://2.0.6+vmware.1-fips
    worker-name-np-6ora-sjwrz65   NotReady   <none>          27d   v1.33.3+vmware.1-fips   10.#.#.50     <none>        VMware Photon OS/Linux   6.1.148-1.ph5    containerd://2.0.6+vmware.1-fips
    worker-name-np-6ora-sjx4txq   Ready      <none>          21d   v1.33.3+vmware.1-fips   10.#.#51      <none>        VMware Photon OS/Linux   6.1.148-1.ph5    containerd://2.0.6+vmware.1-fips
  • The describe of the NotReady node will show event that " Kubelet stopped posting node status"

    Conditions:
      Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
      ----             ------    -----------------                 ------------------                ------              -------
      MemoryPressure   Unknown   Thu, 26 Feb 2026 21:33:08 -0700   Thu, 26 Feb 2026 21:34:43 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
      DiskPressure     Unknown   Thu, 26 Feb 2026 21:33:08 -0700   Thu, 26 Feb 2026 21:34:43 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
      PIDPressure      Unknown   Thu, 26 Feb 2026 21:33:08 -0700   Thu, 26 Feb 2026 21:34:43 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
      Ready            Unknown   Thu, 26 Feb 2026 21:33:08 -0700   Thu, 26 Feb 2026 21:34:43 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  • The kubectl top node will show no status for CPU and Memory usages.

    # kubectl top node

    NAME                           CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%
    cp-name-hndtr-mqn4m           528m         27%         1488Mi          108%
    worker-name-np-6ora-sjp8qbk   81m          4%          1199Mi          87%
    worker-name-np-6ora-sjx4txq   77m          3%          1131Mi          82%
    worker-name-np-6ora-sjwrz65   <unknown>    <unknown>   <unknown>       <unknown>

  • shh to the problematic node is faling or Intermediate faling.

  • capi keeps marking the node as unhealthy and the node will get recreated if it didn't got back to Ready state before  reaching the  Ready=False timeout threshold set on the  MachineHealthCheck 

    +++var/log/pods/svc-tkg-domain-c142716_capi-controller-manager-685d788c7d-qrqdc_######-#######-05062c3afa90/manager/0.log+++

    2026-02-17T20:09:18.849295755Z stderr F I0217 20:09:18.849132       1 recorder.go:104] "Machine cluster-name-lnw/worker-name-np-6ora-sjwrz65 has unhealthy Node " logger="events" type="Normal" object={"kind":"Machine","namespace":"cluster-ns","name":"worker-name-np-6ora-sjwrz65","uid":"a8015209-####_#####-556447f8583a","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"40870933"} reason="DetectedUnhealthy"
    2026-02-17T20:09:18.872372531Z stderr F I0217 20:09:18.872282       1 recorder.go:104] "Machine cluster-name-lnw/worker-name-np-6ora-sjwrz65 has unhealthy Node " logger="events" type="Normal" object={"kind":"Machine","namespace":"cluster-ns","name":"worker-name-np-6ora-sjwrz65","uid":"a8015209-####_#####-556447f8583a","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"40870938"} reason="DetectedUnhealthy"

  • The node is not showing any Memory or CPU high usages  from the vCenter side. 

  • Search the virtual machines based on IP address in vCenter will show another virtual machine with the same iP address.
    • Log in to your vCenter Server using the vSphere Client web interface.
    • Use the global search bar at the top of the interface and simply type the IP address.
    • The search results should filter and display the matching VM/VMs.

 

Environment

VMware vSphere Kubernetes Service

Cause

  • The VKS cluster node enters a NotReady state during an IP conflict because the Kubelet (the agent on the node) can no longer maintain a reliable heartbeat with the kube-apiserver

Resolution

  • Correct the IP address duplication.

Additional Information

If searching the virtual machines based on IP address in vCenter did not show any duplicate IP address you can try to the following steps to confirm if the node getting into NotReady state is due to a duplicate IP address or not .

  • Disconnect the nic for the  virtual machine (of the problematic node that in NotReady state) using the VMware Host Client 

    1. Log in: Open the VMware Host Client by entering the ESXi host IP address in a web browser.
    2. Locate VM: Click on Virtual Machines in the navigator pane and select the target VM.
    3. Edit Settings: Click Edit in the top menu bar.
    4. Toggle NIC State:
      • Expand the Network Adapter section.

      • Disconnect : uncheck the Connected box.

    5. Save: Click Save.
  • Check if you can ssh or ping the problematic node IP address.
    • if you can then Correct the IP address duplication.
    • If can not then this is not a duplicate IP address issue, and you will need to login from the virule machine console to invistigate what causing the node to get into NotReady state.