Supervisor cluster unstable after upgrade to vCenter 8.0U3 and Supervisor Version v1.28.3 caused by Supervisor Control Plane VMs running out of disk space.
search cancel

Supervisor cluster unstable after upgrade to vCenter 8.0U3 and Supervisor Version v1.28.3 caused by Supervisor Control Plane VMs running out of disk space.

book

Article ID: 381590

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

After upgrade to vCenter 8.0U3 and Supervisor version v1.28.3 the cluster is getting unstable

Supervisor services are not available

kubectl login to the Supervisor cluster not working

One or more Supervisor control planes are running out of disk space

root@************** [ ~ ]# df -h|grep /dev/root

/dev/root        32G   30G   45M 100% /

If all 3 Supervisor control planes are up and running again check nodes and etcd health:

Example:

root@************** [ ~ ]# k get nodes
NAME                                     STATUS   ROLES                  AGE     VERSION
**************        Ready    control-plane,master   21m     v1.28.3+vmware.wcp.1
**************        Ready    control-plane,master   64s     v1.28.3+vmware.wcp.1
**************        Ready    control-plane,master   9m57s   v1.28.3+vmware.wcp.1
**************   Ready    agent                  10d     v1.28.2-sph-e515410
**************   Ready    agent                  10d     v1.28.2-sph-e515410
**************   Ready    agent                  10d     v1.28.2-sph-e515410
**************   Ready    agent                  10d     v1.28.2-sph-e515410

root@************** [ ~ ]# etcdctl --cluster=true endpoint health -w table
+--------------------------+--------+------------+-------+
|         ENDPOINT         | HEALTH |    TOOK    | ERROR |
+--------------------------+--------+------------+-------+
| https://**************:2379 |   true | 4.671596ms |       |
| https://**************:2379 |   true | 7.120376ms |       |
| https://**************:2379 |   true | 7.356998ms |       |
+--------------------------+--------+------------+-------+

root@************** [ ~ ]# etcdctl member list -w table
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |               NAME               |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
| ************** | started | ************** | https://**************:2380 | https://**************:2379 |      false |
| ************** | started | ************** | https://**************:2380 | https://**************:2379 |      false |
| ************** | started | ************** | https://**************:2380 | https://**************:2379 |      false |
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
root@************** [ ~ ]# etcdctl --cluster=true endpoint status -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://**************:2379 | ************** |  3.5.11 |  176 MB |     false |      false |        83 |  584410086 |          584410086 |        |
| https://**************:2379 | ************** |  3.5.11 |  176 MB |      true |      false |        83 |  584410086 |          584410086 |        |
| https://**************:2379 | ************** |  3.5.11 |  176 MB |     false |      false |        83 |  584410086 |          584410086 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

 

Environment

GUI Error message may show similar to the below 

Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
  Service: velero.vsphere.vmware.com. Status: Configuring
  Service: tkg.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.

or 

Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
  Service: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tkg-controller (packaging.carvel.dev/v1alpha1) namespace: ###-###-######-#####:
  Finished unsuccessfully (Reconcile failed: (message: kapp: Error: Timed out waiting after 15m0s for resources: [deployment/tkgs-plugin-server (apps/v1) namespace: ###-###-######-#####])).
  Service: velero.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.

Customized guest of Supervisor Control plane VM
Configuration error (since 11/7/2024, 4:57:26 AM)
System error occurred on Master node with identifier **************. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..

 

 

Cause

Supervisor Control plane VM disk space is filled due to an issue with old images not being properly cleaned up. 

 

Resolution

Run the following commands to clean up failed wcp backups, journalctl logs, and audit logging to clean up some space. 

To ssh into the SV VM's follow this kb: https://knowledge.broadcom.com/external/article?legacyId=90194 

journalctl --vacuum-time=2d

cd /var/log/vmware/audit

rm *log.gz 

cd /var/lib/vmware/wcp/backup

rm *

Check disk space again

df -h|grep /dev/root

If etcd is healthy run the cleanup script "cleanup_stale_replicasets.py" and "cleanup_stale_images.py"attached to the KB to perform the cleanup on the Master Supervisor Control plan node.

If etcd is not healthy, please open a case with Broadcom support

Run both scripts attached to this KB in the following order. 

1) On each Supervisor Control Plane Node run this script to clean up the stale replica sets from older versions:

python cleanup_stale_replicasets.py --run

2) On each Supervisor Control Plane Node run this script to delete the images that not part of an active replica set. 

python clean_stale_images.py --run

Note: without option --run  the script will run in dry mode not deleting any replica sets

 

Attachments

clean_stale_images.py get_app
cleanup_stale_replicasets.py get_app