Error "Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status" due to spherelet service failure on ESXi hosts in vSphere 7.0
search cancel

Error "Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status" due to spherelet service failure on ESXi hosts in vSphere 7.0

book

Article ID: 332570

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • The Web GUI displays the following error indicating that the Spherelet and Kubelet services are not functioning correctly on the ESXi hosts:

"Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status"

  • When the node status is queried with the command "kubectl get nodes -A" from the supervisor, the agent is reported as NotReady:

    NAME                     STATUS     ROLES    AGE      VERSION
    [Node-UUID]              Ready      master   3d10h    v1.19.1+wcp.2
    [Node-UUID]              Ready      master   3d10h    v1.19.1+wcp.2
    [Node-UUID]              Ready      master   3d10h    v1.19.1+wcp.2
    [ESXi-Hostname]          NotReady   agent    3d10h    v1.19.1-sph-496a80d
    [ESXi-Hostname]          Ready      agent    3d10h    v1.19.1-sph-496a80d
    [ESXi-Hostname]          NotReady   agent    3d10h    v1.19.1-sph-496a80d

  • Log evidence from the API server (/var/log/vmware/fluentbit/consolidated.log) displays the following errors:

systemd.kubelet.service: {"hostname":"####-##-## ##:##:##(Node-UUID)####-##-## ##:##:##","unit":"kubelet","pid":"425","exe":"/opt/kubernetes/k8s-1.19/bin/kubelet","cmdline":"/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --pod-infra-container-image=vmware/pause:1.19.0 --node-ip=<node_ip>","log":"E0108 remote_runtime.go:389] ExecSync'/bin/sh -c extender_reply=$(curl -k -s -o /dev/null -w %{http_code} https://##.##.##.##:12345/healthz); if [[ \"$extender_reply\" -lt 200 || \"$extender_reply\" -ge 400 ]]; then exit 1; fi; scheduler_healthy=false; for (( i=0; i<8; i++ )); do scheduler_reply=$(curl -k -s -o /dev/null -w %{http_code} http://##.##.##.##:10251/healthz); if [[ \"$scheduler_reply\" -ge 200 && \"$scheduler_reply\" -lt 400 ]]; then scheduler_healthy=true; break; fi; sleep 10; done; if [[ \"$scheduler_healthy\" = false ]]; then exit 1; fi;' from runtime service failed: rpc error: code = Unknown desc = failed to exec in container: failed to start exec: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"exec: \\\"/bin/sh\\\": stat /bin/sh: no such file or directory\": unknown"}]

  • Log evidence in /var/log/vmware/wcp/wcpsvc.log shows the kubenotready error:
vcenter.wcp.node.kubenotready","localized":{"OPTIONAL":"Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status.."},"params":{"OPTIONAL":null}}}}},"severity":"ERROR"}}},
{"STRUCTURE":{"com.vmware.vcenter.namespace_management.clusters.message":{"details":{"OPTIONAL":{"STRUCTURE":{"com.vmware.vapi.std.localizable_message":{"args":["Kubelet stopped posting node status."],"default_message":"Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status..","id":"vcenter.wcp.node.kubenotready","localized":{"OPTIONAL":"Node is not healthy and is not accepting pods. Details Kubelet stopped posting node status.."}
  • The service status on the host indicates it is down:

[root@ESXi-Host:/var/log] /etc/init.d/spherelet status

####-##-## ##:##:##,303 init.d/spherelet Log fetcher support: True
####-##-## ##:##:##,330 init.d/spherelet spherelet is not running
####-##-## ##:##:##,330 init.d/spherelet spherelet is not running

Environment

VMware Kubernetes Service

Cause

This is due to failure in the OCI runtime execution, resulting from missing paths or service failures during process startup. Consequently, the Spherelet service is prevented from running correctly on the affected ESXi hosts.

Resolution

Fixed in VMware vSphere 7.0 Update 2 or a later version.

Workaround:

  1.  Access the affected ESXi host via a secure shell (SSH) or console.

  2. Verify the status of the spherelet service by running the command:

    /etc/init.d/spherelet status

  3. If the service is reported as not running, start the spherelet service with the command: 

    /etc/init.d/spherelet start