During disk pressure events on a worker node (when system disk usage exceeds 85%), the Kubelet garbage collector activates, removing images from the file system until disk pressure subsides. In some cases, this may result in core images, like CoreDNS and the Metrics Server, being deleted. If either of these pods is later scheduled to this worker node, they may fail to start due to the missing images (ImagePullBackoff).
Each worker node includes a job called disk-pressure-watch
, located at /var/vcap/jobs/disk-pressure-watch
, which is responsible for detecting disk pressure and reloading core images onto the node. This script monitors the disk pressure status on the worker node, and when disk pressure is detected (status is True
), it triggers the script execution.
You will notice that disk-pressure-watch (var/vcap/sys/log/disk-pressure-watch/disk-pressure-watch.stdout.log
) never reports pressure:
Sleeping until DiskPressure condition occurs on XXX.XXX.XXX.XXX
Even though Kubelet garbage collector (var/vcap/sys/log/kubelet/kubelet.stderr.log
) has been executed:
I1014 10:30:08.095150 8836 image_gc_manager.go:323] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=4356579328 lowThreshold=80
I1014 10:30:08.110869 8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:a96c6437238723728dhajs8dd082b360d8e054d13c44df622d9197df63efea8e" size=296235462
I1014 10:30:12.855969 8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:3f51440822fd2ab948ff047650c955a0ejfdh7ydsjh28idha92j960807558270" size=155378306
I1014 10:30:12.930358 8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:c0a9e718128026a3859f940d36a40e87w8hfcwc9hccw9w9dwjcchw979w8ew151" size=208980302
I1014 10:30:13.038632 8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:9a3c49cecfc89a09cd8d613adusdhdjhsdw8d78sdh3ie09dwdw208e2f1b91a81" size=145797095
Without imagefs.available
defined in the Kubelet configuration, disk pressure is never marked as True
on the worker node, meaning the node's status doesn’t reflect storage issues. However, the Kubelet's garbage collection still automatically clears up space when needed, mitigating storage issues even without a disk-pressure
event reported.
To resolve this issue, you should update the plan associated with the cluster to include the imagefs.available
configuration flag in Kubelet Customization. Please follow the below steps:
imagefs.available=15%
" flag.tkgi upgrade-cluster service_instance_xxx
" for each cluster after applying the configuration, without using the Upgrade All Clusters errand. References: