vSphere Supervisor Root Disk Space Full at 100%
search cancel

vSphere Supervisor Root Disk Space Full at 100%

book

Article ID: 383369

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu vSphere with Tanzu

Issue/Introduction

Root disk usage has reached 100% on one or more Supervisor Cluster Control Plane VMs in a vSphere Supervisor environment, leading to running out of disk space in root and disk pressure issues.

The following error message may be seen listed for one or more Supervisor control plane nodes when viewing the Config Status of the Supervisor under the Workload Management section of the vSphere Client:

  • System error occurred on Master node with identifier ################################. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..

While SSH to a Supervisor Control Plane VM, the root disk space is 100%:

  • See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
    • The floating IP (FIP) address output by the decryptK8Pwd python script may not be reachable due to disk space issues bringing down critical system processes.
    • Use the management network (eth0) IP address directly assigned to a Supervisor Control Plane VM instead of the floating IP address (FIP).

  • Confirm on current disk space in root:
    df -h /dev/root
    
    Filesystem   Size  Used Avail Use% Mounted on
    /dev/root     ##G   ##G   ##G 100% /

Many system processes will fail and continue to crash while any Supervisor Control Plane VM is at full root disk usage.

This includes the system service which assigns the floating IP address (FIP) to one of the Supervisor Control Plane VMs.

Environment

vSphere Supervisor

This issue can occur regardless of whether or not the environment is managed by Tanzu Mission Control (TMC)

Cause

Disk usage on the cluster is due to a variety of reasons.

  • Log Accumulation: /var/log
  • ETCD Snapshots and Data
  • Container/Pod Logs: /var/log/pods
  • Leftover unused images and replicasets built up over time from previous Supervisor cluster upgrades

VMware by Broadcom Engineering is aware of the issue and is working on fixes to be included in vSphere 8.0 U3g, vSphere 9.0 and higher versions for the below known issues:

  • Failed log rotation of /var/log/vmware/upgrade-ctl-cli.log* files leading to multiple 1GB files appended with an additional number
  • Unused images built-up overtime and leftover from multiple Supervisor cluster upgrades
  • Unused replicasets built-up overtime and leftover from multiple Supervisor cluster upgrades
  • Further reducing disk space usage populated by system journal logging and other logging system services
  • Increased overall disk space of each Supervisor control plane VM

Resolution

If the root disk space in a Supervisor control plane VM reaches 100%, multiple system critical services will fail.

All Supervisor control plane VMs root disk space will need to be cleaned up to restore healthy operation of the Supervisor cluster.

Ideally, root disk space should be below 80%.

 

WARNING: Deleting files without Support's advice can lead to further issues in or potential irrecoverable destruction of the environment. 

__________________________________________________________________________________________________________________________

 

Some steps we can take to look further into this:

Note: if these steps do not help, please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance

  • The following command can be run to substitute head -n # with the top largest files in the node, for example head -n 10 for the top ten largest files:

    find / -path /proc -prune -o -type f -exec du -Sh {} + | sort -rh | head -n #

    This will tell us what is currently taking up most of the room on the disk.

    Note: This KB article focuses primarily log clean-up. The latest log files should not be deleted but echoed empty, if need be.
    For example, where example.log is an example of a log file to be echoed empty:

    echo > example.log

 

Some remediation steps we can take which should be performed for each Supervisor Control Plane VM in the Supervisor cluster: 

  • Audit Logs - Affects all Supervisor Control Plane VMs
    If there has been downtime due to disk space issues for some time, the audit logs for kube-apiserver will fill up rapidly with expected errors of the kube-apiserver being unreachable. The kube-apiserver is unreachable in this scenario because disk space issues have brought down critical system processes such as kube-apiserver and etcd.

    ls -ltrh /var/log/vmware/audit

    Check that it is appropriate to clean up older .log.gz files in this directory.

  • Backup Files - Affects All Supervisor Control Plane VMs
    Backup files involved in taking a vCenter backup in VAMI with the Supervisor cluster included will be stored in this directory.
    ls -ltrh /var/lib/vmware/wcp/backup

    Check that any older backup files can be cleaned up in this directory.

  • upgrade-ctl-cli.logs - Affects all Supervisor Control Plane VMs

    There is a known bug where upgrade-ctl-cli.log log rotation is not working as expected, repeatedly filling up multiple upgrade-ctl-cli.log files (appended with a number) up to 1GB each.

    ls -ltrh /var/log/vmware/upgrade-ctl-cli*

    Older upgrade-ctl-cli.log.# appended with a number can be cleaned up and further reduce disk space usage. Do not delete the latest log file.

    The below command can be run to limit the size of the log file to 10MB (IMPORTANT: This sed command must be re-applied after any Supervisor upgrade until the built-in fix):

    sed -i '/MAX_LOGFILE_SIZE_BYTES =/ s/1024 \* 1024 \* 1024/1024 \* 1024 \* 10/' /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py

    Extra upgrade-ctl-cli log.# files clean up and the above sed command must be performed on each Supervisor control plane VM in the Supervisor cluster.

     

  • journalctl logs - Affects all Supervisor Control Plane VMs

    Journalctl logging can affect disk usage but it is not expected to greatly impact disk space. It may help to compress the current journalctl logging to quickly free some space:

  • To check the current disk usage for journalctl logging, run the below command
    • journalctl --disk-usage
  • To compress and free up space
    • journalctl --vacuum-size=500M
  • Stale Replicasets and Stale Images - Affects all Supervisor Control Plane VMs

    Note: This requires healthy kube-apiserver and ETCD in the Supervisor cluster.

    These stale replicasets and stale images are unused but leftover from previous Supervisor cluster upgrades.

    Retrieve the total replicaset count in the Supervisor cluster (expected healthy count would be under 60):

    kubectl get replicasets -A | wc -l

    Duplicate stale images with different versions can be found from checking the container images list while SSH to a Supervisor control plane VM:

    crictl images list

    vSphere Supervisor DIsk Space Clean Up Scripts KB has scripts regarding cleaning up unused images and replicasets to further help with disk space.

  • ETCD

    Confirm on ETCD status and database size - ETCD is full at 2GB:

    • etcdctl member list -w table
    • etcdctl --cluster=true endpoint health -w table
    • etcdctl --cluster=true endpoint status -w table
    • ls -ltrh /var/lib/etcd/member/snap

    If the above steps show that ETCD is 2GB or more and/or if etcdctl does not return outputs or returns that at least one control plane VM is unhealthy, reach out to VMware by Broadcom Technical Support for assistance, referencing this KB article.


Supervisor cluster upgrades can also reduce root disk space usage, but if the disk space fills back up afterwards, reach out as well to VMware by Broadcom Technical Support for assistance referencing this KB article.

Additional Information

Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0 U3g and higher versions.