"unresponsive agent" reported on TKGI VMs due to /dev/sda1 root partition reaching 100% consumption
search cancel

"unresponsive agent" reported on TKGI VMs due to /dev/sda1 root partition reaching 100% consumption

book

Article ID: 367884

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  • When viewing Bosh VMs via the bosh -d <service-instance_INSTANCE_id> vms command, users see some VMs report "unresponsive agent" state.
  • Running kubectl get nodes against the TKGI cluster shows the associated node in "Ready,SchedulingDisabled" state.
  • When SSH'ed into the problem node via bosh -d <service-instance_INSTANCE_id> ssh worker/<WORKER_ID>, df -h commands show /dev/sda1 filesystem mounted on / consuming more than 63% Used.
    • Nodes increasing past 85% Used might become degraded. 
    • Nodes may not be accessible via SSH depending on how significantly they are degraded.
  • Problem nodes contain pods configured to use /var or custom directories on / for logging or other operations
  • Rebooting the nodes corrects this condition temporarily.

Cause

This condition has been observed in environments using deployments or pods configured with mounts pointed to the /var directory or just the / root directory for logging or other operations. TKGI cluster node VMs are built with 3 disks attached, these are attached as sda1, sdb1, and sdc1. The devices are mounted as follows:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    5G  0 disk
└─sda1   8:1    0    5G  0 part /var/vcap/data        ---------> This directory will contain different folders/files depending on node customizations
                                /home
                                /
sdb      8:16   0   50G  0 disk
└─sdb1   8:17   0   50G  0 part /var/vcap/data/        ---------> This directory will contain different folders/files depending on node customizations
                                /var/tmp
                                /tmp
                                /opt
                                /var/opt
                                /var/log
sdc      8:32   0  100G  0 disk
└─sdc1   8:33   0  100G  0 part /var/vcap/store


If pods require mounts on hostPath or emptyDir for logging or other persistent operations, these should be mounted in /var/vcap/store directory.

Examples of pod mounts that might cause this condition are:

volumeMounts:
- mountPath: /var/lib/example
  name: log-folder

volumes:
- hostPath:
    path: /apps/data/example
    type: ""
  name: log-folder


Example of the same pod mounts in the correct location:

volumeMounts:
- mountPath: /var/lib/example
  name: log-folder
volumes:
- hostPath:
    path: /var/vcap/store/apps/data/example
    type: ""
  name: log-folder

Resolution

Resolution for this will be application specific. Deployments or pods configured to store data on the root / or /var partition will need to be modified to point to /var/vcap/store instead for persistent data. 

If application changes are not immediately available, work around this failure by rebooting the problem nodes until the applications can be permanently changed.