vSphere Supervisor Disk Space Clean Up Scripts

Products

Tanzu Kubernetes Runtime

Issue/Introduction

In the vSphere web UI under Workload Management, the Supervisor Cluster may be in Error or Configuring state with similar errors to the below:

Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
  Service: velero.vsphere.vmware.com. Status: Configuring
  Service: tkg.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.

Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
  Service: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tkg-controller (packaging.carvel.dev/v1alpha1) namespace: <namespace>:
  Finished unsuccessfully (Reconcile failed: (message: kapp: Error: Timed out waiting after 15m0s for resources: [deployment/tkgs-plugin-server (apps/v1) namespace: <namespace>])).
  Service: velero.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.

Customized guest of Supervisor Control plane VM
Configuration error (since DD/M/YYYY, H:MM:SS XM)
System error occurred on Master node with identifier <supervisor node dns name>. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..

While SSH to a Supervisor Control Plane VM, the root disk space is above 80% or at 100%:

See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
- The floating IP address output by the decryptK8Pwd python script may not be reachable due to disk space issues.
- Use the IP address directly assigned to a Supervisor Control Plane VM instead of the floating IP address.
  The related IP addresses can be found when viewing a Supervisor control plane VM from the vSphere web UI Inventory.

Check that root disk space is above 80% or full at 100%:

df -h /dev/root

Filesystem Size Used Avail Use% Mounted on
/dev/root    ##G ##G   ##G 100% /

Many system processes will fail and continue to crash while any Supervisor Control Plane VM is above 80% or at full root disk usage.

This includes the service which assigns the floating IP address to one of the Supervisor Control Plane VMs.

Environment

vSphere Supervisor 7

vSphere Supervisor 8

Cause

Old, unused images and replicasets leftover from previous Supervisor cluster upgrades that are not being automatically cleaned up.

Resolution

WARNING: DO NOT RUN THESE SCRIPTS IF THERE IS AN ONGOING SUPERVISOR, TKC, OR SUPERVISOR SERVICE UPGRADE

---------------------------------

Prior to running the below scripts, critical system processes ETCD and kube-apiserver must be healthy.

ETCD and kube-apiserver will experience issues when disk space is above 80% on any of the Supervisor control plane nodes.

Please see the below KB article for steps on cleaning up disk space in the Supervisor cluster:

vSphere Supervisor Root Disk Space Full at 100%

Health Checks

Connect into the Supervisor cluster context

Check the full member list and quorum for ETCD:

etcdctl member list -w table

+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
|        ID        | STATUS  |               NAME           |         PEER ADDRS             |        CLIENT ADDRS            | IS LEARNER |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
| <etcd-member-a>  | started | <supervisor node dns name 1> | https://<supervisor-ip-1>:2380 | https://<supervisor-ip-1>:2379 |    false   |
| <etcd-member-b>  | started | <supervisor node dns name 2> | https://<supervisor-ip-2>:2380 | https://<supervisor-ip-2>:2379 |    false   |
| <etcd-member-c>  | started | <supervisor node dns name 3> | https://<supervisor-ip-3>:2380 | https://<supervisor-ip-3>:2379 |    false   |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+

Confirm on the health of all members in the ETCD quorum and that there are no errors:

etcdctl --cluster=true endpoint health -w table

+--------------------------------+--------+-------------+-------+
|          ENDPOINT              | HEALTH |    TOOK     | ERROR |
+--------------------------------+--------+-------------+-------+
| https://<supervisor-ip-1>:2379 | true   | ##.##ms     |       |
| https://<supervisor-ip-2>:2379 | true   | ##.##ms     |       |
| https://<supervisor-ip-3>:2379 | true   | ##.##ms     |       |
+--------------------------------+--------+-------------+-------+

Check that all members of ETCD quorum return successfully in the status output and that there are no errors:

etcdctl --cluster=true endpoint status -w table

+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT              |        ID       | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://<supervisor-ip-1>:2379 | <etcd-member-a> |   #.#.# |  ### MB | true      |    false   |        ## |   ######## |           ######## |        |
| https://<supervisor-ip-2>:2379 | <etcd-member-b> |   #.#.# |  ### MB | false     |    false   |        ## |   ######## |           ######## |        |
| https://<supervisor-ip-3>:2379 | <etcd-member-c> |   #.#.# |  ### MB | false     |    false   |        ## |   ######## |           ######## |        |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

All ETCD members should be on the same version and the same DB size. Only one member should be considered the ETCD leader.

If ETCD DB size is at 2.0GB or greater, reach out to VMware by Broadcom Technical Support. ETCD DB is considered full at 2.0GB.

If any of the above etcdctl commands fail or do not return a full 3/3 instances output, reach out to VMware by Broadcom Technical Support.

Otherwise if ETCD is healthy with successful outputs similar to the above examples, check that kubectl commands can be run successfully:

kubectl get pods -A | egrep "etcd|kube-api"

kube-system               etcd-<supervisor-node-dns-name-1>                             1/1     Running
kube-system               etcd-<supervisor-node-dns-name-2>                             1/1     Running
kube-system               etcd-<supervisor-node-dns-name-3>                             1/1     Running 
kube-system               kube-apiserver-<supervisor-node-dns-name-1>                   1/1     Running 
kube-system               kube-apiserver-<supervisor-node-dns-name-2>                   1/1     Running  
kube-system               kube-apiserver-<supervisor-node-dns-name-3>                   1/1     Running

It is expected for there to be one healthy instance of ETCD and kube-apiserver per Supervisor control plane VM in the cluster.

With both ETCD and kube-apiserver confirmed to be healthy as per the above, the scripts attached to this KB can be run as detailed in the section below.

Script Clean Up

The attached python scripts should be copied or SCP directly to the Supervisor control plane VMs that require disk space clean up.
The cleanup_stale_replicasets python script can be run any single Supervisor control plane VM to clean up stale replicasets and does not need to be run more than once per Supervisor cluster. Replicasets are a shared resource across the Supervisor cluster.
```
python cleanup_stale_replicasets.py --run
```
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.
However, the clean_stale_images python script should be run on all Supervisor control plane VMs to clean up stale images because images are stored separately for each VM.
```
python clean_stale_images.py --run
```
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.

Additional Information

Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0 U3g and higher versions.

Attachments

clean_stale_images.py get_app

cleanup_stale_replicasets.py get_app