Kubernetes Node gets to Not Ready status while ESXi host shuts down randomly
search cancel

Kubernetes Node gets to Not Ready status while ESXi host shuts down randomly

book

Article ID: 407078

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Tanzu Kubernetes Grid

Issue/Introduction

  • After a random shutdown of ESXi host, few of the nodes residing on it gets into "NotReady" state

2025-08-06T06:30:28.533136548Z stderr F 2025-08-06T06:30:28Z [error] failed to access the primary CNI configuration from /host/etc/cni/net.d/10-calico.conflist: failed to read the cluster primary CNI config /host/etc/cni/net.d/10-calico.conflist: open /host/etc/cni/net.d/10-calico.conflist: no such file or directory    

2025-08-06T06:30:28.537181567Z stderr F 2025-08-06T06:30:28Z [error] failed to read the primary CNI plugin config from /host/etc/cni/net.d/10-calico.conflist

  • Node status post reboot of ESXi host was stuck in "NotReady" state

[xxxxx@33eb-viotolboxpcf ~]$ kubectl get nodes

NAME                                                       STATUS                         ROLES             AGE      VERSION

22rr-xxx-xxx-01-xxx-xxx-xxxx-application-xxxx-xxxx-xxxx     Ready                         <none>             49d    v1.28.7+vmware.1
22rr-xxx-xxx-01-xxx-xxx-xxxx-xx-xxxx-xxxx-xxxx              NotReady,SchedulingDisabled   <none>             49d    v1.28.7+vmware.1
22rr-xxx-xxx-01-xxx-xxx-xxxx-xx-xxxx-xxxx-xxxx              Ready                         <none>             49d    v1.28.7+vmware.1

Environment

  • VMware Telco Cloud Automation : 3.1.1
  • VMware Tanzu Kubernetes Grid  : 2.5.1 

Cause

  • A race condition between Multus and Calico pods causes the nodes to get into "NotReady" state.
  • On reboot of the host, Multus attempts to generate its CNI configuration before Calico’s primary config (10-calico.conflist) was available.As a result, 00-multus.conf was created with zero size, leaving the CNI plugin uninitialized.
  • The file "/etc/cni/net.d/00-multus.conf" was of  "zero KB" in size and Kubernetes Node became "NotReady" due to CNI was not ready. 
  • The error messages in the logs showed Multus failing to read Calico’s primary CNI config.
  • The zero-byte 00-multus.conf confirmed that the config generation process failed mid-way.

Resolution

  • This behavior is documented as a known issue in the Telco Cloud Automation release notes. Release Notes 3.0

Workaround :

  • Restart multus pod "kube-multus"
  • This in turn will generate "/etc/cni/net.d/00-multus.conf" in the correct manner.