Deployment fails because monit reports job as failed

search cancel

Deployment fails because monit reports job as failed

book

Article ID: 293607

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Symptoms:

Deployment log reports this error

Error: 'cloud_controller/6632bf71-7493-4383-a3f9-9401bafb4710 (1)' is not running after update. Review logs for failed jobs: cloud_controller_ng

Monit Summary shows the job

:~# monit summary
The Monit daemon 5.2.5 uptime: 11d 0h 24m

Process 'consul_agent'              running
Process 'cloud_controller_ng'       Execution Failed
Process 'cloud_controller_worker_local_1' running
Process 'cloud_controller_worker_local_2' running
Process 'nginx_cc'                  running
Process 'routing-api'               running
Process 'metron_agent'              running
Process 'route_registrar'           running
Process 'statsd_injector'           running
Process 'blackbox'                  running
Process 'bosh-dns'                  running
System 'system_localhost'           running

As per "/var/vcap/monit/job/0009_cloud_controller_ng.monitrc" monit will track the process of cloud_controller_ng in file "/var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid"

~# cat /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid
32516

Running ps reveals the process as running and the cloud_controller_ng logs does not report any errors.

~# ps -ef | egrep 32516
root     21852 18512  0 15:51 pts/1    00:00:00 egrep --color=auto 32516
vcap     32516     1  1 May25 ?        02:52:36 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
vcap     32518 32516  0 May25 ?        00:00:00 bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
vcap     32519 32516  0 May25 ?        00:00:00 bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start

Environment

Cause

In some cases monit reports the process as execution failed when it is healthy. This is a known race condition that can occur in certain unavoidable circumstances.

Resolution

Currently there is no fix for this issue, however users can workaround this by restarting the affected process using monit.

The first option is to unmonitor and montior the job. this will force monit to check if the pid is up.

monit unmounitor cloud_controller_ng
monit monitor cloud_controller_ng

The Second option is you can restart the process until monit is re-synced

monit restart cloud_controller_ng

A future release of the Linux Stemcell changes the way processes are reloaded. This should help prevent this type of issue from happening in future.

Feedback

thumb_up Yes

thumb_down No