vSphere Supervisor Root Disk Space Full at 100%

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu vSphere with Tanzu

Issue/Introduction

Root disk usage has reached 100% on one or more Supervisor Cluster Control Plane VMs in a vSphere Supervisor environment, leading to running out of disk space in root and disk pressure issues.

The following error message may be seen listed for one or more Supervisor control plane nodes when viewing the Config Status of the Supervisor under the Workload Management section of the vSphere Client:

System error occurred on Master node with identifier ################################. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..

Supervisor may be showing one node in error state.

The following error message may be seen when curling the kube-apiserver endpoint due to etcd quorom loss:

curl --insecure https://<fip>:6443/healthz
>>
curl: (28) Failed to connect to <fip> port 6443 after 21051 ms: Could not connect to server

While SSH to a Supervisor Control Plane VM, the root disk space is 100%:

See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
- The floating IP (FIP) address output by the decryptK8Pwd python script may not be reachable due to disk space issues bringing down critical system processes.
- Use the management network (eth0) IP address directly assigned to a Supervisor Control Plane VM instead of the floating IP address (FIP).

Confirm on current disk space in root:

df -h /dev/root

Filesystem   Size  Used Avail Use% Mounted on
/dev/root     ##G   ##G   ##G 100% /

Many system processes will fail and continue to crash while any Supervisor Control Plane VM is at full root disk usage.

This includes the system service which assigns the floating IP address (FIP) to one of the Supervisor Control Plane VMs.

Environment

vSphere Supervisor

This issue can occur regardless of whether or not the environment is managed by Tanzu Mission Control (TMC)

Cause

Disk usage on the cluster is due to a variety of reasons.

Log Accumulation: /var/log
ETCD Snapshots and Data
Container/Pod Logs: /var/log/pods
Leftover unused images and replicasets built up over time from previous Supervisor cluster upgrades

VMware by Broadcom Engineering is aware of the issue and is working on fixes to be included in vSphere 8.0 U3g, vSphere 9.0 and higher versions for the below known issues:

Failed log rotation of /var/log/vmware/upgrade-ctl-cli.log* files leading to multiple 1GB files appended with an additional number
Unused images built-up overtime and leftover from multiple Supervisor cluster upgrades
Unused replicasets built-up overtime and leftover from multiple Supervisor cluster upgrades
Further reducing disk space usage populated by system journal logging and other logging system services
Increased overall disk space of each Supervisor control plane VM

Resolution

If the root disk space in a Supervisor control plane VM reaches 100%, multiple system critical services will fail.

All Supervisor control plane VMs root disk space will need to be cleaned up to restore healthy operation of the Supervisor cluster.

Ideally, root disk space should be below 80%.

WARNING: Deleting files without Support's advice can lead to further issues in or potential irrecoverable destruction of the environment.

__________________________________________________________________________________________________________________________

Note: This KB article focuses primarily log clean-up. The latest log files should not be deleted but echoed empty, if need be.
For example, where example.log is an example of a log file to be echoed empty:

echo > example.log

Resolution

SSH into each of the Supervisor Control Plane VMs for the following steps:

See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
- IMPORTANT: The floating IP (FIP) address output by the decryptK8Pwd python script may not be reachable due to disk space issues bringing down critical system processes.
- Use the management network (eth0) IP address directly assigned to a Supervisor Control Plane VM instead of the floating IP address (FIP).

Some remediation steps we can take which should be performed for each Supervisor Control Plane VM in the Supervisor cluster:

Audit Logs - Affects all Supervisor Control Plane VMs
If there has been downtime due to disk space issues for some time, the audit logs for kube-apiserver will fill up rapidly with expected errors of the kube-apiserver being unreachable. The kube-apiserver is unreachable in this scenario because disk space issues have brought down critical system processes such as kube-apiserver and etcd.
```
ls -ltrh /var/log/vmware/audit
```
Check that it is appropriate to clean up older .log.gz files in this directory.
Backup Files - Affects All Supervisor Control Plane VMs
Backup files involved in taking a vCenter backup in VAMI with the Supervisor Cluster included will be stored in this directory.
```
ls -ltrh /var/lib/vmware/wcp/backup
```
Check that any older backup files can be cleaned up in this directory.
upgrade-ctl-cli.logs - Affects all Supervisor Control Plane VMs

There is a known bug where upgrade-ctl-cli.log log rotation is not working as expected, repeatedly filling up multiple upgrade-ctl-cli.log files (appended with a number) up to 1GB each.
```
ls -ltrh /var/log/vmware/upgrade-ctl-cli*
```
Older upgrade-ctl-cli.log.# appended with a number can be cleaned up and further reduce disk space usage. Do not delete the latest upgrade-ctl.cli.log file.

The below command can be run to limit the size of the log file to 10MB (IMPORTANT: This sed command must be re-applied after any Supervisor upgrade until the built-in fix):
```
sed -i '/MAX_LOGFILE_SIZE_BYTES =/ s/1024 * 1024 * 1024/1024 * 1024 * 10/' /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py
```
Extra upgrade-ctl-cli log.# files clean up and the above sed command must be performed on each Supervisor control plane VM in the Supervisor cluster.
journalctl logs - Affects all Supervisor Control Plane VMs

Journalctl logging can affect disk usage but it is not expected to greatly impact disk space. It may help to compress the current journalctl logging to quickly free some space:
To check the current disk usage for journalctl logging, run the below command
- ```
journalctl --disk-usage
```
To compress and free up space
- ```
journalctl --vacuum-size=500M
```

Stale Replicasets and Stale Images - Affects all Supervisor Control Plane VMs

IMPORTANT: This requires healthy kube-apiserver and ETCD on all Supervisor Control Plane VMs. Perform the previous clean up steps on each VM first.

These stale replicasets and stale images are unused but leftover from previous Supervisor cluster upgrades.

Retrieve the total replicaset count in the Supervisor cluster (expected healthy count would be under 60):
```
kubectl get replicasets -A | wc -l
```
Duplicate stale images with different versions can be found from checking the container images list while SSH to a Supervisor control plane VM:
```
crictl images list
```
vSphere Supervisor DIsk Space Clean Up Scripts KB has scripts regarding cleaning up unused images and replicasets to further help with disk space.
ETCD

Confirm on ETCD status and database size - ETCD is full at 2GB:
- ```
etcdctl member list -w table
```
- ```
etcdctl --cluster=true endpoint health -w table
```
- ```
etcdctl --cluster=true endpoint status -w table
```
- ```
ls -ltrh /var/lib/etcd/member/snap
```
If the above steps show that ETCD is 2GB or more and/or if etcdctl does not return outputs or returns that at least one control plane VM is unhealthy, reach out to VMware by Broadcom Technical Support for assistance, referencing this KB article.

Supervisor cluster upgrades can also reduce root disk space usage, but if the disk space fills back up afterwards, reach out as well to VMware by Broadcom Technical Support for assistance referencing this KB article.

Additional Information

Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0 U3g and higher versions.