"unresponsive agent" reported on TKGI VMs due to /dev/sda1 root partition reaching 100% consumption

search cancel

"unresponsive agent" reported on TKGI VMs due to /dev/sda1 root partition reaching 100% consumption

book

Article ID: 367884

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

When viewing Bosh VMs via the bosh -d <service-instance_INSTANCE_id> vms command, users see some VMs report "unresponsive agent" state.
Running kubectl get nodes against the TKGI cluster shows the associated node in "Ready,SchedulingDisabled" state.
When SSH'ed into the problem node via bosh -d <service-instance_INSTANCE_id> ssh worker/<WORKER_ID>, df -h commands show /dev/sda1 filesystem mounted on / consuming more than 63% Used.
- Nodes increasing past 85% Used might become degraded.
- Nodes may not be accessible via SSH depending on how significantly they are degraded.
Problem nodes contain pods configured to use /var or custom directories on / for logging or other operations
Rebooting the nodes corrects this condition temporarily.

Cause

This condition has been observed in environments using deployments or pods configured with mounts pointed to the /var directory or just the / root directory for logging or other operations. TKGI cluster node VMs are built with 3 disks attached, these are attached as sda1, sdb1, and sdc1. The devices are mounted as follows:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS

sda 8:0 0 5G 0 disk

└─sda1 8:1 0 5G 0 part /var/vcap/data ---------> This directory will contain different folders/files depending on node customizations

/home

sdb 8:16 0 50G 0 disk

└─sdb1 8:17 0 50G 0 part /var/vcap/data/ ---------> This directory will contain different folders/files depending on node customizations

/var/tmp

/tmp

/opt

/var/opt

/var/log

sdc 8:32 0 100G 0 disk

└─sdc1 8:33 0 100G 0 part /var/vcap/store

If pods require mounts on hostPath or emptyDir for logging or other persistent operations, these should be mounted in /var/vcap/store directory.

Examples of pod mounts that might cause this condition are:

volumeMounts:
- mountPath: /var/lib/example
name: log-folder

volumes:
- hostPath:
path: /apps/data/example
type: ""
name: log-folder

Example of the same pod mounts in the correct location:

volumeMounts:

- mountPath: /var/lib/example

name: log-folder

volumes:

- hostPath:

path: /var/vcap/store/apps/data/example

type: ""

name: log-folder

Resolution

Resolution for this will be application specific. Deployments or pods configured to store data on the root / or /var partition will need to be modified to point to /var/vcap/store instead for persistent data.

If application changes are not immediately available, work around this failure by rebooting the problem nodes until the applications can be permanently changed.

Feedback

thumb_up Yes

thumb_down No