PCF Diego Cells at 100% inode Utilization
search cancel

PCF Diego Cells at 100% inode Utilization

book

Article ID: 297590

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Customers are using Diego cells not deployed by Elastic Runtime, such as with Isolation segments or using OSS deployment.

Running df -i reports inode usage of 100%. (or high inode utilization)

Diego deployment manifest should have cleanup_process_dirs_on_wait: true:

/var/tempest/workspaces/default/deployments/cf-b726f387316441065827.yml:
  garden:   
      cleanup_process_dirs_on_wait: true

 

This flag --cleanup-process-dirs-on-wait should be on garden when it starts:

/var/vcap/data/jobs/garden/4456fe41ab6291aefe82ef966103d435676f45ca/bin/garden_ctl:
      --cleanup-process-dirs-on-wait \

 

You should see this flag --cleanup-process-dirs-on-wait on gdn process when started :

ps -ef. | grep -i gdn
root      514382  514381  2 Nov18 ?        14:24:19 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind- ...  --cleanup-process-dirs-on-wait

 

If this is not set then deployment manifest should be updated to include: cleanup_process_dirs_on_wait: true.

Error Message:

Application crashes with the following error:

runc exec: exit status 1: exec failed: open /var/vcap/data/garden/depot/... .../.pidfile: No space left on device

 

Environment


Cause

A new garden boolean cleanup_process_dirs_on_wait was introduced in the release: https://github.com/cloudfoundry/garden-runc-release/tree/v1.5.0 - this flag by default is set to false unless explicitly set in deployment. This option being disabled will leave behind stale directories which eventually lead to inodes being exhausted.

Note: Versions of Elastic Runtime that are lower than 1.10.12 will not have this boolean as it uses older than 1.5.0 garden release. (these systems will not be affected by this problem) Refer to release notes for Garden versions packaged with ERT: https://docs.pivotal.io/pivotalcf/1-10/pcf-release-notes/runtime-rn.html

 

Resolution

It will be necessary to update deployment manifest with boolean cleanup_process_dirs_on_wait

For example:
vi /var/tempest/workspaces/default/deployments/p-isolation-segment-XXXX.yml:
garden:   
  cleanup_process_dirs_on_wait: true

Note: that deployment manifest may vary depending what type of manifest has deployed garden. You should check all manifests for garden and verify that they have cleanup_process_dirs_on_wait set to "true".

Once the boolean value is set then execute `bosh deploy <deployment name>` in order to implement the change.

Another option is to bosh recreate Diego cells periodically until the fix is available.

Please note if you make any changes to the configuration in Ops Manager, this will overwrite manual changes to deployment files.

This issue is fixed in an 2.0.x+ releases of PCF Isolation Segment.