Bosh director health flapping due to count-cores service errors

Products

Operations Manager

Issue/Introduction

Bosh health flapping. The issue can cause monitoring systems/tools to show the Bosh vm in an unhealthy state as shown below.

The above pic suggest that there may be a service on the bosh director that is in an unhealthy state or loop. To check for any services that are throwing errors, we can review the services running on the bosh director vm by taking a look at it's monit.log located at /var/vcap/monit/monit.log.

Take note of the repeating errors observed in the monit.logs on the director vm:

[UTC Aug 13 04:49:05] info     : 'count-cores' trying to restart
[UTC Aug 13 04:49:05] info     : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Aug 13 04:49:16] info     : 'count-cores' process is running with pid 416
[UTC Aug 13 04:49:36] error    : 'count-cores' process is not running
[UTC Aug 13 04:49:36] info     : 'count-cores' trying to restart
[UTC Aug 13 04:49:36] info     : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Aug 13 04:49:47] info     : 'count-cores' process is running with pid 829
[UTC Aug 13 04:50:17] error    : 'count-cores' process is not running
[UTC Aug 13 04:50:17] info     : 'count-cores' trying to restart
[UTC Aug 13 04:50:17] info     : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Aug 13 04:50:28] info     : 'count-cores' process is running with pid 1246
[UTC Aug 13 04:50:58] error    : 'count-cores' process is not running

Further investigation will show an error similar to below in /var/vcap/sys/log/count-cores/count-cores.stderr.log
on the Bosh Director vm:

Panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x23 pc=0xa1a336]
 
goroutine 1 [running]:
github.com/pivotal/count-cores/internal/vsphere.(*client).GetVMs(0xc000014388, {0xfdzf58, 0xc001102c020})
/var/vcap/data/compile/count-cores-cli/internal/vsphere/client.go:96 +0x6b6

Environment

Product Version: Other

Resolution

Reason:
This is a bug with the new "count-cores" feature that was introduced with Opsman v2.10.58. At a high level it (count-cores) is grabbing info about virtual machines from the infrastructure, then parsing it and throwing an error.

Note:

Anyone under v2.10.58 should not be impacted as the count-cores feature is not in it.
A patch has been cut in count-cores release v0.3.7 and is available starting in Opsman v.3.0.15(+).

Workarounds:

Ignore the job failures. It will not hurt anything to have the job exit and get restarted by monit.
As the below snippet illustrates, users can manually patch file "/var/vcap/jobs/count-cores/config/bpm.yml" located on the bosh director to set the "-repeat-interval" to an extremely large value (like a 10 year equivalent represented in minutes syntax= "5259600m" ). After saving the new "count-cores: -repeat-interval" as mentioned, the user will need restart the job in order for the changes to take effect with command: "monit restart count-cores". This temporary workaround should disable/postpone "count-cores" feature and prevent the health from flapping.

*Note 1: If the director is redeployed and this causes the setting to be reverted, simply re-apply this workaround.

*Note 2: If the above workarounds do not work for your version of Opsman, then upgrading Opsman to v3.0.15+ will be necessary.