vSphere with Tanzu Guest cluster upgrade is hanging because of etcd

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After initiating an upgrade of a vSphere with Tanzu Guest Cluster, you notice that it appears the upgrade is not progressing
Describing the cluster with kubectl get <cluster | tkc> -n <cluster_namespace> <cluster_name> shows that the cluster's READY column is False
The cluster's control plane VMs have high CPU and/or high Memory usage

Environment

vSphere with Tanzu 8.x

Resolution

Verify the health of the cluster's etcd service:

1. Obtain the IP address of one of the guest cluster's control plane VMs using one of these methods:
  - Use the vSphere Client to find the guest cluster node VM. The IP address will be in the Virtual Machine Details card
  - While using the supervisor context, run kubectl get vm -n <clusters_namespace> -o wide to find the VM's IP
2. Connect to one of the guest cluster's control plane VMs using one of these methods:
  1. Connect to the TKG Service Cluster Control Plane as a Kubernetes Administrator
  2. SSH to TKG Service Cluster Nodes as the System User Using a Private Key
3. Run the following commands to get etcd's health:
  1. sudo -i
  2. alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
  3. etcdctl -w table endpoint --cluster health
4. Note If one of the node's HEALTH column is false and the ERROR is context deadline exceeded:
  - {"level":"warn","ts":"<timestamp>","logger":"client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002d8e00/<IP>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
```
+-----------------------------------+--------+--------------+---------------------------+
|         ENDPOINT                  | HEALTH |     TOOK     |           ERROR           |
+-----------------------------------+--------+--------------+---------------------------+
| https://<CONTROL_PLANE_VM_1>:2379 |   true |   7.039328ms |                           |
| https://<CONTROL_PLANE_VM_2>:2379 |   true |   8.873411ms |                           |
| https://<CONTROL_PLANE_VM_3>:2379 |  false | 5.004791585s | context deadline exceeded |
+-----------------------------------+--------+--------------+---------------------------+
```
5. If you see that one of the nodes is unhealthy, you can attempt to run the command again with a higher timeout than 5 seconds (etcdctl --command-timeout=10s -w table endpoint --cluster health):
  - ```
  +-----------------------------------+--------+--------------+---------------------------+
  |         ENDPOINT                  | HEALTH |     TOOK     |           ERROR           |
  +-----------------------------------+--------+--------------+---------------------------+
  | https://<CONTROL_PLANE_VM_1>:2379 |   true |   7.039328ms |                           |
  | https://<CONTROL_PLANE_VM_2>:2379 |   true |   8.873411ms |                           |
  | https://<CONTROL_PLANE_VM_3>:2379 |   true |   8.0091585s |                           |
  +-----------------------------------+--------+--------------+---------------------------+
```
6. If all the nodes now show as healthy, the node that was previously timing out could be busy with the upgrade process and/or having performance issues due to its configured CPU & Memory resources. You can re-run the etcd commands and see if the TOOK values decrease over time. The TOOK value should stabilize and be similar to the other nodes in the cluster.
7. You can also check the size of the etcd database by running ls -ltrh /var/lib/etcd/member/snap. etcd's maximum size is 2 GB and in most environments is well beneath this. If you see a size approaching 2 GB (such as 1.5 GB), it could be because the amount of worker nodes in the cluster, age of the cluster, the services installed in the cluster, etc. A larger etcd database size could also indicate that there is a large amount of activity in the etcd database that is causing performance issues.

If there doesn't seem to be an issue with etcd, please contact VMware Support by Broadcom for assistance.

vSphere with Tanzu Guest cluster upgrade is hanging because of etcd

Article ID: 392916

Updated On:

Products

Issue/Introduction

Environment

Resolution

Feedback