Logs for app show errors like Caused by: java.io.IOException: No space left on device"
The workload <example-workload> showed all but 1 pod running out of 15 (see screenshot below). The single failing pod was failing on the same node repeatedly. Logging indicates that the pod failed consistently on the same node until the deployment was restarted, at which time, all pods came up successfully.
Tanzu Kubernetes Grid Integrated edition, v1.20
TKGI builds the imagefs and containerfs on the same location, as detailed here: /var/vcap/store. Kubernetes documentation. It's important to note that the imagefs and containerfs are NOT the same as Persistent Volumes, which are used to store the pod's persistent data. The imagefs and containerfs should not be persistent and are stored PER NODE.
Upon connecting to the worker node reported in the above error, we checked the consumed disk space, which reported 84%
At 85% the hard eviction threshold will be met for the node, so containers and kubernetes will not allow a new image to be created on the imagefs (on /var/vcap/store) if the image would cause the node to hit it's threshold. See the following for Kubernetes eviction thresholds
Review the app workload to see if the files being stored are expected, or if there's any way to garbage collect them from the container to manage disk space consumption.
Alternatively, it might be possible to taint a node so other pods can't run on it and add a toleration to the specific app workload so only that pod runs on the node in question. This would prevent other pods from hitting imagepull failures. It would be preferable to adjust the app workload rather than pursue the taint and toleration path, as taints and tolerations would require a special node configuration.