Core Images Deleted by Garbage Collector Are Not Reloaded in TKGI Air-Gapped Environment
search cancel

Core Images Deleted by Garbage Collector Are Not Reloaded in TKGI Air-Gapped Environment

book

Article ID: 380917

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During disk pressure events on a worker node (when system disk usage exceeds 85%), the Kubelet garbage collector activates, removing images from the file system until disk pressure subsides. In some cases, this may result in core images, like CoreDNS and the Metrics Server, being deleted. If either of these pods is later scheduled to this worker node, they may fail to start due to the missing images (ImagePullBackoff).

Each worker node includes a job called disk-pressure-watch, located at /var/vcap/jobs/disk-pressure-watch, which is responsible for detecting disk pressure and reloading core images onto the node. This script monitors the disk pressure status on the worker node, and when disk pressure is detected (status is True), it triggers the script execution.

You will notice that disk-pressure-watch (var/vcap/sys/log/disk-pressure-watch/disk-pressure-watch.stdout.log) never reports pressure:

Sleeping until DiskPressure condition occurs on XXX.XXX.XXX.XXX

Even though Kubelet garbage collector (var/vcap/sys/log/kubelet/kubelet.stderr.log) has been executed: 

I1014 10:30:08.095150    8836 image_gc_manager.go:323] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=4356579328 lowThreshold=80
I1014 10:30:08.110869    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:a96c6437238723728dhajs8dd082b360d8e054d13c44df622d9197df63efea8e" size=296235462
I1014 10:30:12.855969    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:3f51440822fd2ab948ff047650c955a0ejfdh7ydsjh28idha92j960807558270" size=155378306
I1014 10:30:12.930358    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:c0a9e718128026a3859f940d36a40e87w8hfcwc9hccw9w9dwjcchw979w8ew151" size=208980302
I1014 10:30:13.038632    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:9a3c49cecfc89a09cd8d613adusdhdjhsdw8d78sdh3ie09dwdw208e2f1b91a81" size=145797095

 

 

Cause

Without imagefs.available defined in the Kubelet configuration, disk pressure is never marked as True on the worker node, meaning the node's status doesn’t reflect storage issues. However, the Kubelet's garbage collection still automatically clears up space when needed, mitigating storage issues even without a disk-pressure event reported.

Resolution

To resolve this issue, you should update the plan associated with the cluster to include the imagefs.available configuration flag in Kubelet Customization. Please follow the below steps: 

  1. Open the Ops Manager dashboard and select the TKGi tile
  2. Locate the plan you would like to update. 
  3. Under "Kubelet Customization - eviction hard", add the "imagefs.available=15%" flag.
  4. Review the pending changes and apply them to the TKGi tile. To activate this configuration, you'll also need to upgrade the clusters where you'd like the changes to take effect. You can do this in two different ways: 
     
    1. Select Upgrade All Clusters errand when applying the changes in Ops Manager.
    2. Manually run "tkgi upgrade-cluster service_instance_xxx" for each cluster after applying the configuration, without using the Upgrade All Clusters errand. 

References: 

Kubelet Node-pressure Eviction

TKGi plan configuration guide