"container with id exists: bpm-xxx" error when bpm fails to start BOSH job process in Ops Manager

Products

Operations Manager

Issue/Introduction

bpm fails to start BOSH job caused by stale runtime (runc) container state.

A BOSH job fails to start with example error as follows.

# trace from OpsManager ApplyChange or BOSH deploy
Task 1106304 | 19:21:52 | Error: 'diego_database/ed739413-54fc-488c-b69f-ab69789daeaa (0)' is not running after update. Review logs for failed jobs: bbs

We should further check the job logs located at /var/vcap/sys/log/<job>/ in the failing VM. For example, you may see these errors:

# /var/vcap/sys/log/bbs/bpm.log
{"timestamp":"2019-02-13T22:36:06.255829878Z","level":"error","source":"bpm","message":"bpm.start.failed-getting-job","data":{"error":"exit status 1","job":"bbs","process":"bbs","session":"1"}}
# /var/vcap/sys/log/bbs/bbs.stderr.log
container with id exists: bpm-bbs

Create a directory <CONTAINER_ID> in the root director of container state and create a state file state.json in it when running a new container;

Delete the directory <CONTAINER_ID> and the state.json file in it from the root directory of container state when deleting a managed container.

If somehow the directory <CONTAINER_ID> is not successfully removed when runc deletes the container, next time runc is asked to start the container again (by feeding the same <CONTAINER_ID>), it will fail to start the container with error: "container with id exists: <CONTAINER_ID>".

This error which is from the runc code will surface to bpm and will be eventually logged in BOSH job log /var/vcap/sys/log/<job>/<job>.stderr.log as shown in the SYMPTOM section.

Environment

OS: Linux

Resolution

1. Attempting to restart the failed job with the command monit stop bbs is not successful.

monit stop bbs

2. Check the bpm version with this command:

/var/vcap/packages/bpm/bin/bpm --version

bpm has made a change of the choice for the root directory of container state of runc:

If bpm version is 1.1.8 or higher, it uses /var/vcap/sys/run/bpm-runc as the root directory for storage of container state when calling runc.
If bpm version is 1.1.7 or lower, it uses /var/vcap/data/bpm/runc instead.

Prepare the environment variable RUNCROOT which will be used in the following steps.

# for bpm 1.1.7 or lower:
export RUNCROOT=/var/vcap/data/bpm/runc

# for bpm 1.1.8 or higher:
export RUNCROOT=/var/vcap/sys/run/bpm-runc

3. Clean up the stale directory <CONTAINER_ID> from the root directory for container state.

There are two options.

Option 1: Remove the stale directory <CONTAINER_ID> from file system. bpm uses bpm-<job> as container ID when calling runc. For example:

rm -rf ${RUNCROOT}/bpm-bbs

Option 2: Use runc to delete the stale container again.

/var/vcap/packages/bpm/bin/runc --root ${RUNCROOT} delete --force bpm-bbs

# verify the stale directory has been removed
ls -l ${RUNCROOT}

4. Use monit to start the BOSH job and review the job logs to check if there are any further errors.

monit start bbs

Additional information

To list bpm processes, run this command:

~# /var/vcap/packages/bpm/bin/bpm list
Name                   Pid   Status
bbs                    3716  running
locket                 23061 running
policy-server          23141 running
policy-server-internal 23185 running
route_registrar        23101 running
silk-controller        23224 running

To list runc containers run this command:

/var/vcap/packages/bpm/bin/runc --root ${RUNCROOT} list
ID                                         PID         STATUS      BUNDLE                                                                     CREATED                          OWNER
MJRHG---                                   3716        running     /var/vcap/data/bpm/bundles/bbs/bbs                                         2019-02-05T15:00:32.876607659Z   root
NRXWG23FOQ------                           23061       running     /var/vcap/data/bpm/bundles/locket/locket                                   2019-01-28T23:34:19.510136787Z   root
OBXWY2LDPEWXGZLSOZSXE---                   23141       running     /var/vcap/data/bpm/bundles/policy-server/policy-server                     2019-01-28T23:34:21.646283109Z   root
OBXWY2LDPEWXGZLSOZSXELLJNZ2GK4TOMFWA----   23185       running     /var/vcap/data/bpm/bundles/policy-server-internal/policy-server-internal   2019-01-28T23:34:22.719821309Z   root
OJXXK5DFL5ZGKZ3JON2HEYLS                   23101       running     /var/vcap/data/bpm/bundles/route_registrar/route_registrar                 2019-01-28T23:34:20.574210052Z   root
ONUWY2ZNMNXW45DSN5WGYZLS                   23224       running     /var/vcap/data/bpm/bundles/silk-controller/silk-controller                 2019-01-28T23:34:23.808821282Z   root

Note: Starting the container manually will sometimes reveal more relevant errors.

Since this commit (bpm 1.0.3+), bpm starts to use bpm-<job> as container ID when calling runc. Earlier version of bpm uses unreadable string like "MJRHG---" as container IDs, which makes it difficult to link runc containers to bpm jobs. It should not be a problem as recent TAS/TKGI versions use bpm 1.1+.

An example monit config for a BOSH job when bpm is involved as the intermediate management later.

# from TAS diego_database instance

diego_database/231dd4bb-3291-433f-b5ba-1dcdf6e9790d:~# cat /var/vcap/jobs/bbs/monit

check process bbs
  with pidfile /var/vcap/sys/run/bpm/bbs/bbs.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start bbs"
  stop program "/var/vcap/jobs/bpm/bin/bpm stop bbs"
  group vcap

Another example monit config for a BOSH job when bpm is not used.

# from TKGI kubernetes worker node

worker/c3e543ee-8167-4957-bc45-d8f0e991a329:~# cat /var/vcap/jobs/kubelet/monit
check process kubelet
  with pidfile /var/vcap/sys/run/kubernetes/kubelet.pd
  start program "/var/vcap/jobs/kubelet/bin/kubelet_ctl start" with timeout 120 seconds
  stop program "/var/vcap/jobs/kubelet/bin/kubelet_ctl stop"
  group vcap
  depends on docker