Supervisor Cluster Control Plane Nodes Randomly Restart Due to Incompatible Spherelet Version
search cancel

Supervisor Cluster Control Plane Nodes Randomly Restart Due to Incompatible Spherelet Version

book

Article ID: 419024

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Supervisor Cluster Control Plane nodes experience random downtime, leading to instability in production applications. The Supervisor Cluster typically displays an unhealthy status caused by frequent restarts of the etcd and apiserver pods.

  • To verify the pod status, execute the following command: kubectl get pods -A | egrep "NAMESPACE|etcd|apiserver". An unhealthy environment will show a high number of restarts for these components, as seen in the example output below:

# kubectl get pods -A | egrep "NAMESPACE|etcd|apiserver"
NAMESPACE                         NAME                                                              READY   STATUS      RESTARTS        AGE
kube-system                       etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             1/1     Running     13 (36m ago)    204d
kube-system                       etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             1/1     Running     2 (32m ago)     204d
kube-system                       etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             1/1     Running     29 (27m ago)    204d
kube-system                       kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                   1/1     Running     178 (36m ago)   204d
kube-system                       kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                   1/1     Running     232 (32m ago)   204d
kube-system                       kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                   1/1     Running     327 (27m ago)   204d

  • Additionally, the wcpsvc logs will indicate a "Readiness Status" of Unknown Condition for various nodes:

[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxx20 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxx20 with status unknown.

Environment

  • VMware vSphere Kubernetes Service

Cause

  • This issue occurs when a "Non-Critical Baseline" is attached to the cluster, resulting in the installation of an incompatible Spherelet version.

Resolution

To resolve the instability, the incompatible Spherelet version must be addressed by detaching the offending baseline and ensuring the Supervisor version is current.

Step 1: Detach the Non-Critical Baseline

Remove the "Non-Critical Host Patches (Predefined)" baseline from the cluster to prevent further installation of incompatible components.

Baselines to review and detach:

  • Non-Critical Host Patches (Predefined)
  • Host Security Patches (Predefined)
  • Critical Host Patches (Predefined)

Step 2: Upgrade Supervisor Services

Upgrade the Supervisor to a compatible version to ensure the correct Spherelet versions are deployed across the environment. Detailed instructions can be found in Upgrade a Supervisor

Additional Information

Spherelet - ESXi - VC Compatibility