BOSH process manager fails to start BOSH job when no space left on tmpfs /var/vcap/data/sys/run
search cancel

BOSH process manager fails to start BOSH job when no space left on tmpfs /var/vcap/data/sys/run

book

Article ID: 298467

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

You are upgrading Small Footprint VMware Tanzu Application Service for VMs (TAS for VMs) and using Xenial stemcell 621.113 / 456.150 or lower. During the Apply Changes, you receive the following error:
Task 3712 | 10:01:35 | Updating instance control: control/2af16d75-24de-4b13-a12d-29ab7ca66247 (0) (canary) (00:08:48)
L Error: 'control/2af16d75-24de-4b13-a12d-29ab7ca66247 (0)' is not running after update. Review logs for failed jobs: bbs, leadership-election, loggregator_agent, loggr-syslog-agent, metrics-discovery-registrar, metrics-agent, loggr-forwarder-agent, loggr-udp-forwarder, loggr-syslog-binding-cache, prom_scraper, loggr-system-metric-scraper, locket, route_registrar, policy-server, policy-server-internal, silk-controller, uaa, statsd_injector, cloud_controller_ng, ccng_monit_http_healthcheck, cloud_controller_worker_local_1, cloud_controller_worker_local_2, nginx_cc, routing-api, cloud_controller_clock, cloud_controller_worker_1, cc_deployment_updater, auctioneer, cc_uploader, file_server, ssh_proxy, tps_watcher, loggregator_trafficcontroller, reverse_log_proxy, reverse_log_proxy_gateway, doppler, credhub, bosh-system-metrics-forwarder, log-cache, log-cache-cf-auth-proxy, log-cache-gateway, log-cache-nozzle, service-discovery-controller, metric_registrar_orchestrator, metric_registrar_log_worker, metric_registrar_endpoint_worker, bosh-dns, bosh-dns-resolvconf, bosh-dns-healthcheck, system-metrics-agent
Exit code 1

The partition, /var/vcap/data/sys/run, with size 1 MB is full.

Screen Shot 2021-10-18 at 1.53.08 PM.png


Cause

When BOSH process manager (bpm) creates a container for a BOSH job, it stores the container state in the following paths:

  • bpm 1.1.8+: /var/vcap/data/sys/run/bpm-runc
  • bpm 1.1.7-: /var/vcap/data/bpm/runc
The BOSH agent in certain lower Xenial stemcell versions mounts /var/vcap/data/sys/run as an 1 MB tmpfs. For Small Footprint TAS for VMs, there could be 50+ BOSH jobs colocated in a single VM. For example, 51 jobs in the control VM. The files to store container state would use up the 1 MB disk space under /var/vcap/data/sys/run.

In lower Small Footprint TAS for VMs versions (with bpm v1.1.7 or lower), the container states are stored in /var/vcap/data/bpm/runc, which is a sub-path in ephemeral disk. Therefore, this issue does not manifest with lower bpm versions.

When hitting this issue, it is also possible to see error "no space left on device" in the BOSH job's log.
time="2021-10-27T05:02:36Z" level=error msg="write /var/vcap/sys/run/bpm/metrics-agent/.metrics-agent.pid: no space left on device"


Environment

Product Version: 2.9
OS: linux

Resolution

Workaround

To temporarily workaround this issue, enlarge the size the tmpfs mounted at /var/vcap/data/sys/run.

  1. Run this command to stop all services: monit stop all
  2. Run this command: umount /var/vcap/data/sys/run
  3. Run this command: mount -t tmpfs -o rw,relatime,size=16m tmpfs /var/vcap/data/sys/run
  4. Run this command to start all services: monit start all


Resolution

The following BOSH agent version, with corresponding stemcell version, sets up a16 MB disk space for the tmpfs mounted at /var/vcap/data/sys/run. Updating stemcell would permanently resolve the problem.

  • BOSH agent 2.268.21+ (Xenial stemcell 621.115+)
  • BOSH agent 2.234.11+ (Xenial stemcell 456.152+)


References

  • Stemcell (Linux) Release Notes
  • The BOSH agent code for setting up the tmpfs: https://github.com/cloudfoundry/bosh-agent/blob/v2.268.21/platform/linux_platform.go#L805-L835