Control Plan Nodes in a Container Service Extension Guest Cluster restart on a regular basis.

search cancel

Control Plan Nodes in a Container Service Extension Guest Cluster restart on a regular basis.

book

Article ID: 412522

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

Pods in our Kubernetes cluster restart every day at about the same time and we can't find the error.
In the logs you observe:

E0425 07:36:36.085306 1 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: etcdserver: leader changed

E0425 07:37:07.522864 1 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: Get "https://<IP>:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

E0425 07:37:12.522172 1 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: Get "https://<IP>:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s": context deadline exceeded

I0425 07:37:12.522513 1 leaderelection.go:283] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition

F0425 07:37:12.522644 1 controllermanager.go:234] leaderelection lost

For the Control Plan Nodes, you notice large I/O spikes on the storage hosting the VMs.
kubectl logs -n kube-system etcd-control-plane-1234 --all-containers -f | grep -i "fdatasync"

{"level":"warn","ts":"2025-09-11T15:43:13.341189Z","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"3.250706104s","expected-duration":"1s"}
{"level":"warn","ts":"2025-09-11T15:43:34.591388Z","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"21.248814627s","expected-duration":"1s"}
{"level":"warn","ts":"2025-09-11T15:43:35.76005Z","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"1.167246894s","expected-duration":"1s"}

Environment

VMware Cloud Director 10.6.1
Container Service Extension 4.2.1

Cause

This issue can occur if there is any slowness observed on the underlying storage backing the Control Plane VMs.

The ETCD database in Kubernetes is leveraged for leader elections.
It's consensus protocol depends on persistently storing metadata to a log, a majority of etcd cluster members must write every request down to disk. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election, undermining the stability of the cluster.

https://etcd.io/docs/v3.3/op-guide/hardware/

Resolution

Review the environment with your storage/infrastructure team and look to isolate the causes of the Storage performance bottlenecks.

Open the vSphere Client and locate the control plane node VMs for an example cluster.
Select one of the control plane node VMs and open the Monitor > Performance > Advanced > Chart Options view.
Select Chart Metrics > Disk, change the Timespan to Last day or Custom interval covering restarts, and then select the Highest latency and Usage counters.

Once no latency spikes are observed, confirm if the restarts still occur.

Feedback

thumb_up Yes

thumb_down No