"Cannot connect to the monit daemon" error during BOSH tasks.
search cancel

"Cannot connect to the monit daemon" error during BOSH tasks.

book

Article ID: 293621

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Intermittently, during BOSH tasks such as "bosh deploy", you might encounter failures such as Cannot connect to the monit daemon, example snippet below,

 
L Error: Action Failed get_task: Task 4e2d3e05-be15-471b-5220-ec67584b08a9 result: Stopping Monitored Services: Stop all services: Running command: 'monit stop -g vcap', stdout: '', stderr: 'monit-actual: > Cannot connect to the monit daemon. Did you start it with http support?
monit-actual: Cannot connect to the monit daemon. Did you start it with http support?
': exit status 1
Task 2924 | 01:14:03 | Error: Action Failed get_task: Task 4e2d3e05-be15-471b-5220-ec67584b08a9 result: Stopping Monitored Services: Stop all services: Running command: 'monit stop -g vcap', stdout: '', stderr: 'monit-actual: > Cannot connect to the monit daemon. Did you start it with http support?
monit-actual: Cannot connect to the monit daemon. Did you start it with http support?
': exit status 1
 

Background

  • The error was caused by a TCP RESET ("Connection Refused") being dropped by the current firewall rule "...` ! --cgroup xxx -j DROP`". Rather than closing the socket, this left the socket lingering in a `LAST-ACK` state for approximately 106-108 seconds. When "monit" is heavily active, that lingering socket could be randomly selected by a subsequent monit call, forcing a timeout.
  • The RESET was being dropped because it came from a kernel, which does not belong to the correct cgroup because the kernel does not belong to any cgroups because the kernel is not a process.
  • The reason the kernel is delivering the packet is that the sender with the correct cgroup is gone (has exited) and kernel is left with the responsibility of delivering the packet, and when the associated iptables rule is checked, there is no associated cgroup to the RESET packet, so the packet gets dropped, leaving a dangling connection.

 

Workaround 

 

The fix is included in the Stemcell itself, as a workaround, you can apply this iptables command to any VM within the deployment using the Stemcell (including Bosh director vm):
 

iptables -t mangle -I POSTROUTING -d 127.0.0.1/32 -p tcp -m tcp --dport 2822 -m state --state ESTABLISHED,RELATED -j ACCEPT


For Bosh Director VM ssh:
 

-Locate and copy the Director's password in Operations Manager under BOSH tile >> credential >> VM credentials.
-Run ssh -o StrictHostKeyChecking=no vcap@<bosh vm ip>  to ssh as 'vcap' user.

 


This command resolves by inserting a rule which allows all packets that are already established.

 

Note: It is highly likely that you would need to re-run the command after any recreate of the bosh VM,  this is because if you use the iptables command, it would persist the new rules onto either ephemeral or root disk for the VM. Only persistent disks survive a recreate of the VM.




Environment

OS: 2.10.47-build.562

Resolution

Engineering has included the fix into the Stemcell. If you face this issue would need to be using Jammy v1.64+ stemcell or Xenial v621.364+  and should be able to upload the fixed stemcell within Operations Manager and assign it to the tile having monit issues, including the BOSH Director.