VMs are randomly being recreated during BOSH director upgrade when upgrade Tanzu Operations Manager.
Recreated VMs have been reported unresponsive with following message in BOSH task logs:
Applying problem resolutions: VM for 'tsdb/abcde-1234-567-fghi-1234567890 (0)' with cloud ID 'vm-abcde-fghijk-lmno-qurs-uwxyz1243456' is not responding. (unresponsive_agent 107456): Recreate VM and wait for processes to start (00:04:04)
This is occurring on random VMs from cloud controllers to diego cells to healthwatch VMs.
NATS(messaging system) on BOSH director VM can't get sufficient CPU immediately after OS initialization and it's missing heartbeat messages from some BOSH-deployed VMs.
After BOSH director VM is redeployed, health_monitor waits a "grace" period of 30 seconds before starting to check heartbeats. This is to give time to NATS to start and become stable. This period could be insufficient in some cases, with high number of VM's/agents, where bosh NATS can't handle such amount of processing while starting (every process consumes high CPU when starting). Increasing that 30 seconds grace period will help to avoid the problem. That can be overridden by Operations Manager API. For example, to 90 seconds:
{
"overrides": [
{
"section": "instance_groups",
"data": {
"hm": {
"intervals": {
"poll_grace_period": 90
}
}
}
}
]
}
To make this change, you would need to enable advanced mode first, which is made by setting "locked" field to false. Bellow, the om commands you would need to do full process.
om curl -p /api/v0/staged/infrastructure/locked -x PUT -d '{ "locked": false}'
om curl -p /api/v0/staged/infrastructure/locked
>>>> output
{
"locked": false,
"advanced_mode": true
}
om curl -p /api/v0/staged/director/overrides -x PUT -d '{"overrides":[{"section":"instance_groups","data":{"hm":{"intervals":{"poll_grace_period":90}}}}]}'
om curl -p /api/v0/staged/director/overrides
>>>> output
{
"overrides": [
{
"section": "instance_groups",
"data": {
"hm": {
"intervals": {
"poll_grace_period": 90
}
}
}
}
]
}
Engineering is working to include the setting in BOSH director configuration dashboard.