Bosh health flapping. The issue can cause monitoring systems/tools to show the Bosh vm in an unhealthy state as shown below.
The above pic suggest that there may be a service on the bosh director that is in an unhealthy state or loop. To check for any services that are throwing errors, we can review the services running on the bosh director vm by taking a look at it's monit.log located at /var/vcap/monit/monit.log.
[UTC Aug 13 04:49:05] info : 'count-cores' trying to restart [UTC Aug 13 04:49:05] info : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm [UTC Aug 13 04:49:16] info : 'count-cores' process is running with pid 416 [UTC Aug 13 04:49:36] error : 'count-cores' process is not running [UTC Aug 13 04:49:36] info : 'count-cores' trying to restart [UTC Aug 13 04:49:36] info : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm [UTC Aug 13 04:49:47] info : 'count-cores' process is running with pid 829 [UTC Aug 13 04:50:17] error : 'count-cores' process is not running [UTC Aug 13 04:50:17] info : 'count-cores' trying to restart [UTC Aug 13 04:50:17] info : 'count-cores' start: /var/vcap/jobs/bpm/bin/bpm [UTC Aug 13 04:50:28] info : 'count-cores' process is running with pid 1246 [UTC Aug 13 04:50:58] error : 'count-cores' process is not running
Further investigation will show an error similar to below in /var/vcap/sys/log/count-cores/count-cores.stderr.log
on the Bosh Director vm:
Panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x23 pc=0xa1a336] goroutine 1 [running]: github.com/pivotal/count-cores/internal/vsphere.(*client).GetVMs(0xc000014388, {0xfdzf58, 0xc001102c020}) /var/vcap/data/compile/count-cores-cli/internal/vsphere/client.go:96 +0x6b6
Reason:
This is a bug with the new "count-cores" feature that was introduced with Opsman v2.10.58. At a high level it (count-cores) is grabbing info about virtual machines from the infrastructure, then parsing it and throwing an error.
Note:
Workarounds:
*Note 1: If the director is redeployed and this causes the setting to be reverted, simply re-apply this workaround.
*Note 2: If the above workarounds do not work for your version of Opsman, then upgrading Opsman to v3.0.15+ will be necessary.