HealthMonitor keeps shutting down and "monit restart" is not able to solve the issue

search cancel

HealthMonitor keeps shutting down and "monit restart" is not able to solve the issue

book

Article ID: 293850

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

If the BOSH Director is down or returns a non successful HTTP status, the Health Monitor attempts to log an error message, which should be rescued and logged as a failure to connect to the BOSH Director.

However, BOSH director versions prior to 271.9.0 had an issue where logging the error message itself causes an unhandled error, causing the process to exit (see this commit).

Therefore, if you observed below symptoms, you may need to consider an upgrade to OpsManager v2.10.16 ( Ships with BOSH Director 271.9.0)+.

The health_monitor process keeps shutting down and and shows "Connection failed" status if you run "monit summary" command.

$ cat health_monitor_2022-07-15T00.log| grep "HealthMonitor shutting down" | wc -l
      50
 
$ cat health_monitor_2022-07-15T00.log| grep "HealthMonitor shutting down" | awk '{print $2}' | sort -n
[2022-07-15T00:02:01.986314
[2022-07-15T00:02:33.713786
[2022-07-15T00:04:59.442516
...
[2022-07-15T00:29:08.416517
[2022-07-15T00:29:08.424290
[2022-07-15T00:31:41.679941
[2022-07-15T00:31:41.689279

In the /var/vcap/sys/log/health_monitor/health_monitor.log, you will see below FATAL errors.

F, [2022-07-15T00:02:01.986037 #8] FATAL : undefined method `uri' for #<EventMachine::HttpClient:0x00007ff828143048>
F, [2022-07-15T00:02:01.986215 #8] FATAL : /var/vcap/data/packages/health_monitor/ebd720e04124ece241fa14883bec6443d5934a5d/gem_home/ruby/2.6.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/director.rb:36:in `get_deployment_instances'
F, [2022-07-15T00:02:33.713363 #8] FATAL : undefined method `uri' for #<EventMachine::HttpClient:0x0000564f7cae7d70>
F, [2022-07-15T00:02:33.713723 #8] FATAL : /var/vcap/data/packages/health_monitor/ebd720e04124ece241fa14883bec6443d5934a5d/gem_home/ruby/2.6.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/director.rb:36:in `get_deployment_instances'

In the monit.log, you will be able to observe error messages similar to below.

[UTC Jul 15 00:14:11] info : 'health_monitor' trying to restart
[UTC Jul 15 00:14:12] info : 'health_monitor' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Jul 15 00:14:24] info : 'health_monitor' process is running with pid 26678
[UTC Jul 15 00:14:24] error : HTTP: error receiving data -- Resource temporarily unavailable
[UTC Jul 15 00:14:25] error : 'health_monitor' failed protocol test [HTTP] at INET[localhost:25923/healthz] via TCP
[UTC Jul 15 00:14:25] info : 'health_monitor' exec: /var/vcap/jobs/bpm/bin/bpm
[UTC Jul 15 00:14:36] error : 'health_monitor' process is not running

You tried restarting the process with monit command, however, it failed to solve the issue. After rebooting the BOSH director, the process was be able to come back normal and running.
You are using OpsManager v2.10.15 or earlier versions.

Environment

Product Version: 2.10

Resolution

Upgrade to OpsManager v2.10.16 ( Ships with BOSH Director 271.9.0).

Feedback

thumb_up Yes

thumb_down No