BOSH commands hang indefinitely or timeout

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

BOSH commands appear to hang indefinitely or timeout. Executing the following command reveals hundreds or thousands of scan and fix tasks:

bosh -e director tasks --no-filter

Environment

Cause

This type of scenario typically manifests itself during a BOSH deployment when many tasks are generated, while at the same time there is a BOSH agent that is intermittently skipping heartbeats.

Basically what can happen is the BOSH agent will miss a heartbeat and health monitor (which runs on BOSH Director) will trigger the creation of a scan and fix task. When scan and fix execute it finds that the bad agent has successfully sent a heartbeat and skips resurrection of the instance.

This will repeatedly happen hundreds of times causing the Director task queue to build up. And when there are many long-running deployment tasks executing this can cause a race condition where task queue grows too large.

Resolution

Note- running bosh stop, start, restart, and recreate may result in undesirable behavior if deployment changes are in progress. When troubleshooting these types of issues it is best to avoid executing these commands. Instead, use the IAAS or BOSH CCK to engage these types of troubleshooting actions.

SSH into the BOSH Director vm and disable health monitor. This will stop health monitor from creating new tasks. After about 10 minutes the inflight scan and fix tasks will eventually timeout.
```
monit stop health_monitor
```
Restart the BOSH Director process to force all the queued scan and fix tasks to cancel
```
monit restart director
```
Then we need to Identify which VM is triggering the scan and fix tasks. This may not always be apparent because bosh vms will sometimes report the agent as responsive and may only report it as unresponsive intermittently
Check the output of BOSH vms to see if you can quickly identify unresponsive agents.
```
bosh -e director vms --details
```
If the output of BOSH vms does not show which VM ( for example there could be many VM's then we need to ssh into the BOSH Director and review the health monitor logs. Search for the following warnings in the log.
```
WARN : (Resurrector) notifying director to recreate unresponsive VM: cf-300c3738aa8b3ad21fca router/07a82795-821f-4466-9509-f19ac2caf927
```
Once you have identified the instance or instances causing this issue then attempt to ssh into them and gather any logs that could help determine why the agent is behaving this way. This will not always be possible but in some cases, operators will be able to directly SSH into the vm. The symptoms could be a result of network, cpu, or memory resource constraints. The following directories and commands should be collected when possible.
- /var/vcap/sys/log
- /var/vcap/bosh/logs
- df -h
- free
- ps aux
- ps -ef
- uptime
Attempting to reboot the identified virtual machines may resolve the agents' state in some cases. It is worth trying this first to see if that will enable ssh access to the instance. This might resolve the state and also provide a means of collecting the needed logs.
If a reboot does not resolve then power off the Virtual Machine from the IAAS interface. This will ensure the BOSH agent remains unresponsive and stop health monitor from generating new scan and fix tasks in the future
If the VMs are left in a powered off state because reboot is not successful then BOSH CCK can be used to recreate the virtual machine instances