Some TKGi worker nodes crash because they run out of space to download images

search cancel

Some TKGi worker nodes crash because they run out of space to download images

book

Article ID: 394906

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Logs for app show errors like Caused by: java.io.IOException: No space left on device"

The workload <example-workload> showed all but 1 pod running out of 15 (see screenshot below). The single failing pod was failing on the same node repeatedly. Logging indicates that the pod failed consistently on the same node until the deployment was restarted, at which time, all pods came up successfully.

Environment

Tanzu Kubernetes Grid Integrated edition, v1.20

Cause

TKGI builds the imagefs and containerfs on the same location, as detailed here: /var/vcap/store. Kubernetes documentation. It's important to note that the imagefs and containerfs are NOT the same as Persistent Volumes, which are used to store the pod's persistent data. The imagefs and containerfs should not be persistent and are stored PER NODE.

Upon connecting to the worker node reported in the above error, we checked the consumed disk space, which reported 84%

At 85% the hard eviction threshold will be met for the node, so containers and kubernetes will not allow a new image to be created on the imagefs (on /var/vcap/store) if the image would cause the node to hit it's threshold. See the following for Kubernetes eviction thresholds

Resolution

Review the app workload to see if the files being stored are expected, or if there's any way to garbage collect them from the container to manage disk space consumption.

Alternatively, it might be possible to taint a node so other pods can't run on it and add a toleration to the specific app workload so only that pod runs on the node in question. This would prevent other pods from hitting imagepull failures. It would be preferable to adjust the app workload rather than pursue the taint and toleration path, as taints and tolerations would require a special node configuration.

Feedback

thumb_up Yes

thumb_down No