Root disk usage has reached 100% on one or more Supervisor Cluster Control Plane VMs in a vSphere Supervisor environment, leading to running out of disk space in root and disk pressure issues.
The following error message may be seen listed for one or more Supervisor control plane nodes when viewing the Config Status of the Supervisor under the Workload Management section of the vSphere Client:
System error occurred on Master node with identifier ################################. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..
While SSH to a Supervisor Control Plane VM, the root disk space is 100%:
df -h /dev/root
Filesystem Size Used Avail Use% Mounted on
/dev/root ##G ##G ##G 100% /
Many system processes will fail and continue to crash while any Supervisor Control Plane VM is at full root disk usage.
This includes the system service which assigns the floating IP address (FIP) to one of the Supervisor Control Plane VMs.
vSphere Supervisor
This issue can occur regardless of whether or not the environment is managed by Tanzu Mission Control (TMC)
Disk usage on the cluster is due to a variety of reasons.
VMware by Broadcom Engineering is aware of the issue and is working on fixes to be included in vSphere 8.0 U3g, vSphere 9.0 and higher versions for the below known issues:
If the root disk space in a Supervisor control plane VM reaches 100%, multiple system critical services will fail.
All Supervisor control plane VMs root disk space will need to be cleaned up to restore healthy operation of the Supervisor cluster.
Ideally, root disk space should be below 80%.
__________________________________________________________________________________________________________________________
Some steps we can take to look further into this:
Note: if these steps do not help, please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance
The following command can be run to substitute head -n # with the top largest files in the node, for example head -n 10 for the top ten largest files:
find / -path /proc -prune -o -type f -exec du -Sh {} + | sort -rh | head -n #
This will tell us what is currently taking up most of the room on the disk.
Note: This KB article focuses primarily log clean-up. The latest log files should not be deleted but echoed empty, if need be.
For example, where example.log is an example of a log file to be echoed empty:
echo > example.log
Some remediation steps we can take which should be performed for each Supervisor Control Plane VM in the Supervisor cluster:
ls -ltrh /var/log/vmware/audit
Check that it is appropriate to clean up older .log.gz files in this directory.
ls -ltrh /var/lib/vmware/wcp/backup
Check that any older backup files can be cleaned up in this directory.
upgrade-ctl-cli.logs - Affects all Supervisor Control Plane VMs
There is a known bug where upgrade-ctl-cli.log log rotation is not working as expected, repeatedly filling up multiple upgrade-ctl-cli.log files (appended with a number) up to 1GB each.
ls -ltrh /var/log/vmware/upgrade-ctl-cli*
Older upgrade-ctl-cli.log.# appended with a number can be cleaned up and further reduce disk space usage. Do not delete the latest log file.
The below command can be run to limit the size of the log file to 10MB (IMPORTANT: This sed command must be re-applied after any Supervisor upgrade until the built-in fix):
sed -i '/MAX_LOGFILE_SIZE_BYTES =/ s/1024 \* 1024 \* 1024/1024 \* 1024 \* 10/' /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py
Extra upgrade-ctl-cli log.# files clean up and the above sed command must be performed on each Supervisor control plane VM in the Supervisor cluster.
journalctl logs - Affects all Supervisor Control Plane VMs
Journalctl logging can affect disk usage but it is not expected to greatly impact disk space. It may help to compress the current journalctl logging to quickly free some space:
journalctl --disk-usage
journalctl --vacuum-size=500M
Stale Replicasets and Stale Images - Affects all Supervisor Control Plane VMs
Note: This requires healthy kube-apiserver and ETCD in the Supervisor cluster.
These stale replicasets and stale images are unused but leftover from previous Supervisor cluster upgrades.
Retrieve the total replicaset count in the Supervisor cluster (expected healthy count would be under 60):
kubectl get replicasets -A | wc -l
Duplicate stale images with different versions can be found from checking the container images list while SSH to a Supervisor control plane VM:
crictl images list
vSphere Supervisor DIsk Space Clean Up Scripts KB has scripts regarding cleaning up unused images and replicasets to further help with disk space.
ETCD
Confirm on ETCD status and database size - ETCD is full at 2GB:
etcdctl member list -w table
etcdctl --cluster=true endpoint health -w table
etcdctl --cluster=true endpoint status -w table
ls -ltrh /var/lib/etcd/member/snap
If the above steps show that ETCD is 2GB or more and/or if etcdctl does not return outputs or returns that at least one control plane VM is unhealthy, reach out to VMware by Broadcom Technical Support for assistance, referencing this KB article.
Supervisor cluster upgrades can also reduce root disk space usage, but if the disk space fills back up afterwards, reach out as well to VMware by Broadcom Technical Support for assistance referencing this KB article.
Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0 U3g and higher versions.