While connected to a workload cluster through kubectl vsphere login or directly SSH into control plane nodes, kubectl commands are frequently failing with the below error message:
etcdserver: leader changed
vSphere Supervisor
This issue can occur regardless of whether or not the cluster is managed by Tanzu Mission Control (TMC)
ETCD functions in a quorum with one of its instances elected as leader.
Networking or resource issues in the affected cluster is causing ETCD to detect that the current leader is not responding within a desired timeframe and attempts to change the ETCD leader to resolve this. ETCD uses ports 2380 and 2379.
Changing the ETCD leader can lead to a brief moment where kubectl commands return the error "etcdserver: leader changed", but in an environment where the new ETCD leader is healthy, it is not expected for the issue to reoccur.
In this scenario, the affected cluster is experiencing networking or resource issues across all control plane nodes which in turn causes the ETCD leader to be considered unhealthy enough to continue to switch between quorum members as per the above.
Connect into one of the affected control plane nodes in the cluster through SSH:
Check the ETCD logs of the ETCD containers in the affected control plane nodes.
crictl ps --name etcd
crictl logs <etcd container id>
If there are time-outs and slow responses logged, this may be an indication of high resource usage slowing down response times.
Check for high resource usage on the control plane node.
kubectl top pods --all-namespaces --sort-by=memory
kubectl top nodes --sort-by=memory
If the high resource usage is noted to be by kube-apiserver, this may be caused by a large number of resource objects and/or high number of requests being made to the kube-apiserver.
Check for a large count of kubernetes objects stored in the cluster. The below command searches for any object counts higher than 100:
kubectl get --raw=/metrics | grep apiserver_storage_objects | awk '$2>100' | sort -n -k 2
Large numbers of kubernetes objects can not only cause high resource issues, but also fill up the ETCD DB. By default, the ETCD is full at 2GB.
Use caution when cleaning up kubernetes objects. Reach out to VMware by Broadcom Technical Support referencing this KB for assistance.
Check for and clean up any pods in Error, Evicted or ContainerStatusUnknown state.
Kubernetes by default does not clean up any pods in the above states. This means that these failed pods can easily accumulate over time.
kubectl get pods -A -o wide | egrep -v "Run|Completed"
Known Issues