Supervisor Cluster Control Plane Nodes Randomly Restart Due to Incompatible Spherelet Version

search cancel

Supervisor Cluster Control Plane Nodes Randomly Restart Due to Incompatible Spherelet Version

book

Article ID: 419024

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Cluster Control Plane nodes experience random downtime, leading to instability in production applications. The Supervisor Cluster typically displays an unhealthy status caused by frequent restarts of the etcd and apiserver pods.
To verify the pod status, execute the following command: kubectl get pods -A | egrep "NAMESPACE|etcd|apiserver". An unhealthy environment will show a high number of restarts for these components, as seen in the example output below:

# kubectl get pods -A | egrep "NAMESPACE|etcd|apiserver"
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 13 (36m ago) 204d
kube-system etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 2 (32m ago) 204d
kube-system etcd-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 29 (27m ago) 204d
kube-system kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 178 (36m ago) 204d
kube-system kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 232 (32m ago) 204d
kube-system kube-apiserver-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1/1 Running 327 (27m ago) 204d

Additionally, the wcpsvc logs will indicate a "Readiness Status" of Unknown Condition for various nodes:

[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxxd9 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxx20 with status unknown.
[YYYY-MM-DDTHH:MM:SS] debug wcp [kubelifecycle/node_reflector.go:124] Handling update for node 4200xxxxxxxxxxxxxxxxxxxxxxxxxx20 with status unknown.

Environment

VMware vSphere Kubernetes Service

Cause

This issue occurs when a "Non-Critical Baseline" is attached to the cluster, resulting in the installation of an incompatible Spherelet version.

Resolution

To resolve the instability, the incompatible Spherelet version must be addressed by detaching the offending baseline and ensuring the Supervisor version is current.

Step 1: Detach the Non-Critical Baseline

Remove the "Non-Critical Host Patches (Predefined)" baseline from the cluster to prevent further installation of incompatible components.

Baselines to review and detach:

Non-Critical Host Patches (Predefined)
Host Security Patches (Predefined)
Critical Host Patches (Predefined)

Step 2: Upgrade Supervisor Services

Upgrade the Supervisor to a compatible version to ensure the correct Spherelet versions are deployed across the environment. Detailed instructions can be found in Upgrade a Supervisor

Additional Information

Spherelet - ESXi - VC Compatibility

Feedback

thumb_up Yes

thumb_down No