If the BOSH Director is down or returns a non successful HTTP status, the Health Monitor attempts to log an error message, which should be rescued and logged as a failure to connect to the BOSH Director.
However, BOSH director versions prior to 271.9.0 had an issue where logging the error message itself causes an unhandled error, causing the process to exit (see
this commit).
Therefore, if you observed below symptoms, you may need to consider an upgrade to OpsManager v2.10.16 ( Ships with BOSH Director 271.9.0)+.
- The health_monitor process keeps shutting down and and shows "Connection failed" status if you run "monit summary" command.
$ cat health_monitor_2022-07-15T00.log| grep "HealthMonitor shutting down" | wc -l
50
$ cat health_monitor_2022-07-15T00.log| grep "HealthMonitor shutting down" | awk '{print $2}' | sort -n
[2022-07-15T00:02:01.986314
[2022-07-15T00:02:33.713786
[2022-07-15T00:04:59.442516
...
[2022-07-15T00:29:08.416517
[2022-07-15T00:29:08.424290
[2022-07-15T00:31:41.679941
[2022-07-15T00:31:41.689279
- In the /var/vcap/sys/log/health_monitor/health_monitor.log, you will see below FATAL errors.
F, [2022-07-15T00:02:01.986037 #8] FATAL : undefined method `uri' for #<EventMachine::HttpClient:0x00007ff828143048>
F, [2022-07-15T00:02:01.986215 #8] FATAL : /var/vcap/data/packages/health_monitor/ebd720e04124ece241fa14883bec6443d5934a5d/gem_home/ruby/2.6.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/director.rb:36:in `get_deployment_instances'
F, [2022-07-15T00:02:33.713363 #8] FATAL : undefined method `uri' for #<EventMachine::HttpClient:0x0000564f7cae7d70>
F, [2022-07-15T00:02:33.713723 #8] FATAL : /var/vcap/data/packages/health_monitor/ebd720e04124ece241fa14883bec6443d5934a5d/gem_home/ruby/2.6.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/director.rb:36:in `get_deployment_instances'
- In the monit.log, you will be able to observe error messages similar to below.
[UTC Jul 15 00:14:11] info : 'health_monitor' trying to restart
[UTC Jul 15 00:14:12] info : 'health_monitor' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Jul 15 00:14:24] info : 'health_monitor' process is running with pid 26678
[UTC Jul 15 00:14:24] error : HTTP: error receiving data -- Resource temporarily unavailable
[UTC Jul 15 00:14:25] error : 'health_monitor' failed protocol test [HTTP] at INET[localhost:25923/healthz] via TCP
[UTC Jul 15 00:14:25] info : 'health_monitor' exec: /var/vcap/jobs/bpm/bin/bpm
[UTC Jul 15 00:14:36] error : 'health_monitor' process is not running
- You tried restarting the process with monit command, however, it failed to solve the issue. After rebooting the BOSH director, the process was be able to come back normal and running.
- You are using OpsManager v2.10.15 or earlier versions.