bpm fails to start BOSH job caused by stale runtime (runc) container state.
A BOSH job fails to start with example error as follows.
# trace from OpsManager ApplyChange or BOSH deploy Task 1106304 | 19:21:52 | Error: 'diego_database/ed739413-54fc-488c-b69f-ab69789daeaa (0)' is not running after update. Review logs for failed jobs: bbs
/var/vcap/sys/log/<job>/
in the failing VM. For example, you may see these errors:
# /var/vcap/sys/log/bbs/bpm.log {"timestamp":"2019-02-13T22:36:06.255829878Z","level":"error","source":"bpm","message":"bpm.start.failed-getting-job","data":{"error":"exit status 1","job":"bbs","process":"bbs","session":"1"}} # /var/vcap/sys/log/bbs/bbs.stderr.log container with id exists: bpm-bbs
Create a directory
<CONTAINER_ID>
in the root director of container state and create a state file state.json
in it when running a new container;<CONTAINER_ID>
and the state.json
file in it from the root directory of container state when deleting a managed container.monit stop bbs
/var/vcap/packages/bpm/bin/bpm --version
/var/vcap/sys/run/bpm-runc
as the root directory for storage of container state when calling runc./var/vcap/data/bpm/runc
instead.# for bpm 1.1.7 or lower: export RUNCROOT=/var/vcap/data/bpm/runc # for bpm 1.1.8 or higher: export RUNCROOT=/var/vcap/sys/run/bpm-runc
<CONTAINER_ID>
from the root directory for container state.<CONTAINER_ID>
from file system. bpm uses bpm-<job>
as container ID when calling runc. For example:
rm -rf ${RUNCROOT}/bpm-bbs
/var/vcap/packages/bpm/bin/runc --root ${RUNCROOT} delete --force bpm-bbs # verify the stale directory has been removed ls -l ${RUNCROOT}
monit start bbs
~# /var/vcap/packages/bpm/bin/bpm list Name Pid Status bbs 3716 running locket 23061 running policy-server 23141 running policy-server-internal 23185 running route_registrar 23101 running silk-controller 23224 running
/var/vcap/packages/bpm/bin/runc --root ${RUNCROOT} list ID PID STATUS BUNDLE CREATED OWNER MJRHG--- 3716 running /var/vcap/data/bpm/bundles/bbs/bbs 2019-02-05T15:00:32.876607659Z root NRXWG23FOQ------ 23061 running /var/vcap/data/bpm/bundles/locket/locket 2019-01-28T23:34:19.510136787Z root OBXWY2LDPEWXGZLSOZSXE--- 23141 running /var/vcap/data/bpm/bundles/policy-server/policy-server 2019-01-28T23:34:21.646283109Z root OBXWY2LDPEWXGZLSOZSXELLJNZ2GK4TOMFWA---- 23185 running /var/vcap/data/bpm/bundles/policy-server-internal/policy-server-internal 2019-01-28T23:34:22.719821309Z root OJXXK5DFL5ZGKZ3JON2HEYLS 23101 running /var/vcap/data/bpm/bundles/route_registrar/route_registrar 2019-01-28T23:34:20.574210052Z root ONUWY2ZNMNXW45DSN5WGYZLS 23224 running /var/vcap/data/bpm/bundles/silk-controller/silk-controller 2019-01-28T23:34:23.808821282Z root
Note: Starting the container manually will sometimes reveal more relevant errors.
Since this commit (bpm 1.0.3+), bpm starts to use bpm-<job>
as container ID when calling runc. Earlier version of bpm uses unreadable string like "MJRHG---
" as container IDs, which makes it difficult to link runc containers to bpm jobs. It should not be a problem as recent TAS/TKGI versions use bpm 1.1+.
An example monit config for a BOSH job when bpm is involved as the intermediate management later.
# from TAS diego_database instance diego_database/231dd4bb-3291-433f-b5ba-1dcdf6e9790d:~# cat /var/vcap/jobs/bbs/monit check process bbs with pidfile /var/vcap/sys/run/bpm/bbs/bbs.pid start program "/var/vcap/jobs/bpm/bin/bpm start bbs" stop program "/var/vcap/jobs/bpm/bin/bpm stop bbs" group vcap
# from TKGI kubernetes worker node worker/c3e543ee-8167-4957-bc45-d8f0e991a329:~# cat /var/vcap/jobs/kubelet/monit check process kubelet with pidfile /var/vcap/sys/run/kubernetes/kubelet.pd start program "/var/vcap/jobs/kubelet/bin/kubelet_ctl start" with timeout 120 seconds stop program "/var/vcap/jobs/kubelet/bin/kubelet_ctl stop" group vcap depends on docker