lsof -p <bosh agent pid>
shows an established connection with the director on port 25777 from the agent instance. This TCP session should close once immediately after the director returns the http response. If this connection is lingering for an extended period of time then it is likely the agent is hung.bosh-agent 2707 root 6u IPv4 248913 0t0 TCP mysql-proxy-0.node.dc1.cf.internal:44934->192.168.10.41:25777 (ESTABLISHED)
BOSH hangs sending when sending an API request to the director. For some reason the http request sent by the agent never gets a response and the agent waits indefinitely which blocks agent heartbeats from being sent to the director.
From an OS perspective, the TCP socket for port 25777 is established as per the kernel. The session is not established anymore on the director side and the http transport will wait indefinitely for the TCP session to close or for responses to be returned.
This is fixed in the latest stemcell release. BOSH Agent will now aggressively timeout on NATS connection failure after 5 minutes.
As a workaround, ssh directly to the affected VMs from the Operations (Ops) Manager VM and kill the BOSH agent process as follows.
ssh vcap@ sudo -i ps aux |grep agent ps -ef |grep bosh-agent| grep -v grep root 866 857 0 Jul04 ? 00:26:38 /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json kill -9 857
The BOSH agent will be automatically recovered and start on a new PID.