Excessive BOSH Director Health Monitor alerts after upgrading Ops Manager from 3.0.34 to 3.0.35
search cancel

Excessive BOSH Director Health Monitor alerts after upgrading Ops Manager from 3.0.34 to 3.0.35

book

Article ID: 383448

calendar_today

Updated On:

Products

Operations Manager VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

After upgrading Ops Manager from 3.0.34 to 3.0.35 with a successful "Apply Changes" for BOSH Director tile and all other tiles, user might get lots of  alerts from BOSH Director Health Monitor similar to what's shown below.

---
agent e8fbaf23-####-####-####-e126c1d54b7e [] e8fbaf23-####-####-####-e126c1d54b7e is not a part of any deployment
Severity: 2
Summary: e8fbaf23-####-####-####-e126c1d54b7e is not a part of any deployment
Time: 2024-12-03 01:48:23 UTC
---
Health monitor failed to connect to director
Severity: 3
Summary: Unable to send get /deployments to director: Invalid URI: https://<BOSH Director IP>:25555/info
Time: 2024-12-03 01:48:31 UTC
---
Health monitor failed to connect to director
Severity: 3
Summary: Unable to send get /configs to director: Invalid URI: https://<BOSH Director IP>:25555/info
Time: 2024-12-03 01:48:31 UTC
---

 

Same messages could also be seen in BOSH Director audit log file /var/vcap/sys/log/director/audit.log

I, [2024-12-03T02:10:25.888945 #23] []  INFO -- DirectorAudit: {"id":4592293,"parent_id":null,"user":"health_monitor","timestamp":"2024-12-03 02:10:25 UTC","action":"create","object_type":"alert","object_name":"a8b76583-####-####-####-50c7e7a01196","error":null,"task":null,"deployment":null,"instance":null,"context_json":"{\"message\":\"b1ac3f7b-####-####-####-de861367c926 is not a part of any deployment. Alert @ 2024-12-03 02:10:25 UTC, severity 2: b1ac3f7b-####-####-####-de861367c926 is not a part of any deployment\"}"}

I, [2024-12-03T02:10:31.314463 #20] []  INFO -- DirectorAudit: {"id":4592294,"parent_id":null,"user":"health_monitor","timestamp":"2024-12-03 02:10:31 UTC","action":"create","object_type":"alert","objec
t_name":"4620a48e-####-####-####-ef49b6a644da","error":null,"task":null,"deployment":null,"instance":null,"context_json":"{\"message\":\"Health monitor failed to connect to director. Alert @ 2024-12-03 
02:10:31 UTC, severity 3: Unable to send get /deployments to director: Invalid URI: https://<BOSH Director IP>:25555/info\"}"}

I, [2024-12-03T02:10:31.316049 #23] []  INFO -- DirectorAudit: {"id":4592295,"parent_id":null,"user":"health_monitor","timestamp":"2024-12-03 02:10:31 UTC","action":"create","object_type":"alert","object_name":"3b841131-####-####-####-4857b99f7c20","error":null,"task":null,"deployment":null,"instance":null,"context_json":"{\"message\":\"Health monitor failed to connect to director. Alert @ 2024-12-03 02:10:31 UTC, severity 3: Unable to send get /configs to director: Invalid URI: https://<BOSH Director IP>:25555/info\"}"}

 

This might also lead to the Bosh Resurrector "not working" or failing to recreate VM's that enter unresponsive state for managed tile VMs.

Environment

Ops Manager 3.0.35

Bosh Director v280.1.120

Cause

The problem is caused by a defect in BOSH Director when introducing Async::HTTP module where Async::HTTP::Endpoint.parse() function is called without passing in a string type parameter.

 

All BOSH agents are considered as unmanaged agents, because health monitor fails to fetch deployments information from BOSH director endpoint. Because of the problem, some metrics such as BOSH unresponsive agents(bosh_unresponsive_agents) are missing. Health Watch dashboard shows "No Data" with "Bosh unresponsive Agents" metrics. 

Resolution

A open source Github issue has been create to trace this issue. And the fix will be included in future Ops Manager releases which have BOSH Director v280.1.12+. Opsman 3.0.36+ should resolve this issue.