After initiating an upgrade of a vSphere with Tanzu Guest Cluster, you notice that it appears the upgrade is not progressing
Describing the cluster with kubectl get <cluster | tkc> -n <cluster_namespace> <cluster_name> shows that the cluster's READY column is False
The cluster's control plane VMs have high CPU and/or high Memory usage
vSphere with Tanzu 8.x
Verify the health of the cluster's etcd service:
kubectl get vm -n <clusters_namespace> -o wide to find the VM's IP sudo -ialias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'etcdctl -w table endpoint --cluster health
HEALTH column is false and the ERROR is context deadline exceeded:{"level":"warn","ts":"<timestamp>","logger":"client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002d8e00/<IP>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+-----------------------------------+--------+--------------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+-----------------------------------+--------+--------------+---------------------------+
| https://<CONTROL_PLANE_VM_1>:2379 | true | 7.039328ms | |
| https://<CONTROL_PLANE_VM_2>:2379 | true | 8.873411ms | |
| https://<CONTROL_PLANE_VM_3>:2379 | false | 5.004791585s | context deadline exceeded |
+-----------------------------------+--------+--------------+---------------------------+
etcdctl --command-timeout=10s -w table endpoint --cluster health):
+-----------------------------------+--------+--------------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+-----------------------------------+--------+--------------+---------------------------+
| https://<CONTROL_PLANE_VM_1>:2379 | true | 7.039328ms | |
| https://<CONTROL_PLANE_VM_2>:2379 | true | 8.873411ms | |
| https://<CONTROL_PLANE_VM_3>:2379 | true | 8.0091585s | |
+-----------------------------------+--------+--------------+---------------------------+
ls -ltrh /var/lib/etcd/member/snap. etcd's maximum size is 2 GB and in most environments is well beneath this. If you see a size approaching 2 GB (such as 1.5 GB), it could be because the amount of worker nodes in the cluster, age of the cluster, the services installed in the cluster, etc. A larger etcd database size could also indicate that there is a large amount of activity in the etcd database that is causing performance issues.If there doesn't seem to be an issue with etcd, please contact VMware Support by Broadcom for assistance.