High Filesystem Utilization on TKG Worker Node due to stale backup processes

search cancel

High Filesystem Utilization on TKG Worker Node due to stale backup processes

book

Article ID: 422705

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

TKG worker nodes report high filesystem utilization alerts

Kubernetes garbage collector fails to reclaim space, and the following error messages are observed in the kubelet logs:

"Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold"
"Image garbage collection failed once. Stats initialization may not have completed yet" & "err="failed to garbage collect required amount of images."

Restarting the containerd and kubelet services does not reduce disk usage

Environment

TCA 3.2
TKG 2.5.2

Cause

The root cause is a stale or hung backup process (external to VMware components) that retains active file handles on deleted temporary files.
While the backup application has technically "deleted" the files (often located in /tmp/cbur/), OS unable to reclaim the disk space because the process responsible for them is still active/hung.

Resolution

Engage backup solution vendor to investigate why the backup jobs are failing or hanging on the worker nodes.

Additional Information

To resolve the immediate disk space issue, the stale process must be terminated.

Log in to the affected worker node as root.
Run the following command to identify deleted files that are still being held open by a process:

find /proc/*/fd -ls 2>/dev/null | grep "(deleted)"

Look for output similar to: /tmp/cbur/20XX11272XXXXX_LOCAL_xxxx_admincli_volume.tar.gz (deleted)
Note the Process ID (PID) from the output of the command above.
Terminate the stale process to release the file handles:

kill -9 <PID>
Verify that filesystem utilization has dropped to expected levels:

df -h

Feedback

thumb_up Yes

thumb_down No