VMs recreated by BOSH health_monitor after bosh director deployment
search cancel

VMs recreated by BOSH health_monitor after bosh director deployment

book

Article ID: 369864

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

VMs are randomly being recreated during BOSH director upgrade when upgrade Tanzu Operations Manager.

Recreated VMs have been reported unresponsive with following message in BOSH task logs: 

Applying problem resolutions: VM for 'tsdb/abcde-1234-567-fghi-1234567890 (0)' with cloud ID 'vm-abcde-fghijk-lmno-qurs-uwxyz1243456' is not responding. (unresponsive_agent 107456): Recreate VM and wait for processes to start (00:04:04)

This is occurring on random VMs from cloud controllers to diego cells to healthwatch VMs. 

Cause

NATS(messaging system) on BOSH director VM can't get sufficient CPU immediately after OS initialization and it's missing heartbeat messages from some BOSH-deployed VMs.

Resolution

After BOSH director VM is redeployed, health_monitor waits a "grace" period of 30 seconds before starting to check heartbeats. This is to give time to NATS to start and become stable. This period could be insufficient in some cases, with high number of VM's/agents, where bosh NATS can't handle such amount of processing while starting (every process consumes high CPU when starting). Increasing that 30 seconds grace period will help to avoid the problem. That can be overridden by Operations Manager API. For example, to 90 seconds:

 
 
{ "overrides": [ { "section": "instance_groups", "data": { "hm": { "intervals": { "poll_grace_period": 90 } } } } ] }

To make this change, you would need to enable advanced mode first, which is made by setting "locked" field to false. Bellow, the om commands you would need to do full process.

  1. om curl -p /api/v0/staged/infrastructure/locked -x PUT -d '{ "locked": false}'
  2. you can check that advance  mode is actually enabled with
    om curl -p /api/v0/staged/infrastructure/locked
    
    >>>> output
    {
      "locked": false,
      "advanced_mode": true
    }
  3. Change the grace period to a higher value. 90 or 120 seconds could be a good starting point, and you can increase it if problem still persists.
    om curl -p /api/v0/staged/director/overrides -x PUT -d '{"overrides":[{"section":"instance_groups","data":{"hm":{"intervals":{"poll_grace_period":90}}}}]}'
  4. You can check that the change took place by running
    om curl -p /api/v0/staged/director/overrides
    
    >>>> output
    {
      "overrides": [
        {
          "section": "instance_groups",
          "data": {
            "hm": {
              "intervals": {
                "poll_grace_period": 90
              }
            }
          }
        }
      ]
    }
  5. After deploying bosh director, you can ssh into the VM and check that poll_grace_period in /var/vcap/jobs/health_monitor/config/health_monitor.yml  has changed to the new value.

Engineering is working to include the setting in BOSH director configuration dashboard.