Core Images Deleted by Garbage Collector Are Not Reloaded in TKGI Air-Gapped Environment
search cancel

Core Images Deleted by Garbage Collector Are Not Reloaded in TKGI Air-Gapped Environment

book

Article ID: 380917

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During disk pressure events on a worker node (when system disk usage exceeds 85%), Kubelet image garbage collector activates and remove images from the file system until disk pressure subsides (by default under 85%). In some cases, this may result in core images, like CoreDNS and the Metrics Server, being deleted. If either of these pods is later scheduled to this worker node, they may fail to start due to the missing images (ImagePullBackoff).

Each worker node includes a job called disk-pressure-watch, located at /var/vcap/jobs/disk-pressure-watch, which is responsible for detecting disk pressure and reloading core images onto the node. This script monitors the disk pressure status on the worker node, and when disk pressure is detected (DiskPressure status is True), it triggers the script execution.

You will notice that disk-pressure-watch (var/vcap/sys/log/disk-pressure-watch/disk-pressure-watch.stdout.log) never reports pressure:

Sleeping until DiskPressure condition occurs on XXX.XXX.XXX.XXX

Even though Kubelet garbage collector (var/vcap/sys/log/kubelet/kubelet.stderr.log) has been executed: 

I1014 10:30:08.095150    8836 image_gc_manager.go:323] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=4356579328 lowThreshold=80
I1014 10:30:08.110869    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:a96c6437238723728dhajs8dd082b360d8e054d13c44df622d9197df63efea8e" size=296235462
I1014 10:30:12.855969    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:3f51440822fd2ab948ff047650c955a0ejfdh7ydsjh28idha92j960807558270" size=155378306
I1014 10:30:12.930358    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:c0a9e718128026a3859f940d36a40e87w8hfcwc9hccw9w9dwjcchw979w8ew151" size=208980302
I1014 10:30:13.038632    8836 image_gc_manager.go:400] "Removing image to free bytes" imageID="sha256:9a3c49cecfc89a09cd8d613adusdhdjhsdw8d78sdh3ie09dwdw208e2f1b91a81" size=145797095

 

 

Cause

Kubelet starts two threads monitoring disk usage, thresholds for both are 85% usage by default. 

  • image garbage collector manager checks exact disk usage rate by calculating ( used bytes / total bytes)
  • eviction manager monitors if imagefs.available (available 15% by default, which means usage 85%) signal gets triggered when available disk space drops down under 15%.

There is possibility,  image garbage collector gets triggered and starts removing old images, but eviction manager is not notified by signal. As the result, disk-pressure-watch can not detect DiskPressure=true event and would not reload the deleted system component images.

Resolution

As quick work around, you can reload the missing images with /var/vcap/jobs/load-images/bin/post_start on a worker VM.

To resolve this permanently, you could update the plan associated with the cluster to include the imagefs.available configuration flag in Kubelet Customization. Please follow the below steps: 

  1. Open the Ops Manager dashboard and select the TKGi tile
  2. Locate the plan you would like to update. 
  3. Under "Kubelet Customization - eviction hard", add the "imagefs.available=XX%" flag, make XX larger than 15, for example 20, so that eviction manager will be triggered prior to image garbage collector. 
  4. Review the pending changes and apply them to the TKGi tile. To activate this configuration, you'll also need to upgrade the clusters where you'd like the changes to take effect. You can do this in two different ways: 
     
    1. Select Upgrade All Clusters errand when applying the changes in Ops Manager.
    2. Manually run "tkgi upgrade-cluster service_instance_xxx" for each cluster after applying the configuration, without using the Upgrade All Clusters errand. 

References: