Scan & fix on a Windows instance (TKGi cluster worker node) is consistently failing with the error:
Timed out sending ''get_state'' to instance: ''windows-worker/8f50de11-6122-4026-80a7-fe8e02c20a6a'', agent-id: ''f995ea7b-e56f-4aec-8239-3e95fea83ab5'' after 10 seconds
This error comes after the “start” action was completed. It seems that there’s a NATS disconnection around the time the Director sent the “get_state” task.
The bosh task logs show that the Director was able to send the “start” action and received the response to that from the agent successfully. The "get_state" task was also sent but there was no response that was logged.
D, [2024-02-08T16:14:46.625301 #864064] [task:38567] DEBUG -- DirectorJobRunner: SENT: agent.f995ea7b-e56f-4aec-8239-3e95fea83ab5 {"protocol":3,"method":"start","arguments":[],"reply_to":"director.db42edc4-5926-4f28-90bf-5c85e89d6ee7.f995ea7b-e56f-4aec-8239-3e95fea83ab5.36083e1f-88d2-4b36-a687-2ac481de3028"}
D, [2024-02-08T16:14:52.312620 #864064] [] DEBUG -- DirectorJobRunner: RECEIVED: director.db42edc4-5926-4f28-90bf-5c85e89d6ee7.f995ea7b-e56f-4aec-8239-3e95fea83ab5.36083e1f-88d2-4b36-a687-2ac481de3028 {"value":"started"}
The Director proceeded with the next step, which is to check the state of the Windows instance. It sent the “get_state” action to the agent. However, it timed out waiting for a response after 10 seconds.
I, [2024-02-08T16:14:52.313544 #864064] [task:38567] INFO -- DirectorJobRunner: Waiting for 10.0 seconds to check windows-worker/8f50de11-6122-4026-80a7-fe8e02c20a6a (0) status
I, [2024-02-08T16:15:02.323652 #864064] [task:38567] INFO -- DirectorJobRunner: Checking if windows-worker/8f50de11-6122-4026-80a7-fe8e02c20a6a (0) has been updated after 10.0 seconds
D, [2024-02-08T16:15:02.323853 #864064] [task:38567] DEBUG -- DirectorJobRunner: SENT: agent.f995ea7b-e56f-4aec-8239-3e95fea83ab5 {"protocol":3,"method":"get_state","arguments":[],"reply_to":"director.db42edc4-5926-4f28-90bf-5c85e89d6ee7.f995ea7b-e56f-4aec-8239-3e95fea83ab5.140d466a-d65d-4f02-8c17-365176aa8ca2"}
D, [2024-02-08T16:15:12.329317 #864064] [task:38567] DEBUG -- DirectorJobRunner: (0.000348s) (conn: 37860) UPDATE "tasks" SET "event_output" = ("event_output" || '{"time":1707408912,"stage":"Applying problem resolutions","tags":[],"total":1,"task":"VM for ''windows-worker/8f50de11-6122-4026-80a7-fe8e02c20a6a (0)'' with cloud ID ''vm-ab81bf1d-724c-44ca-afd0-3f38a181e871'' is not responding. (unresponsive_agent 410): Recreate VM and wait for processes to start","index":1,"state":"failed","progress":100,"data":{"error":"Timed out sending ''get_state'' to instance: ''windows-worker/8f50de11-6122-4026-80a7-fe8e02c20a6a'', agent-id: ''f995ea7b-e56f-4aec-8239-3e95fea83ab5'' after 10 seconds"}}
From the bosh agent logs, the particular “get_state” action was not received by the agent. There was a NATS disconnection that happened during that time as well. The agent logs indicate that the agent was disconnected to NATS at that time.
********************
[MBus Handler] 2024/02/08 16:14:52 INFO - Responding
[MBus Handler] 2024/02/08 16:14:52 DEBUG - Payload
********************
{"value":"started"}
********************
[NATS Handler] 2024/02/08 16:14:52 INFO - Sending hm message 'alert'
[NATS Handler] 2024/02/08 16:14:52 DEBUG - Message Payload
********************
{"id":"kubelet","severity":2,"title":"kubelet (172.20.2.8) - pid failed - Start","summary":"exited with code 1","created_at":1707408892}
********************
[NATS Handler] 2024/02/08 16:14:57 DEBUG - Increased reconnect to: 2s
[NATS Handler] 2024/02/08 16:14:57 DEBUG - Nats disconnected with Error: read tcp 172.20.2.8:49671->192.168.40.21:4222: wsarecv: An established connection was aborted by the software in your host machine.
[NATS Handler] 2024/02/08 16:14:57 DEBUG - Attempting to reconnect: true
[Cmd Runner] 2024/02/08 16:14:57 DEBUG - Running command: arp -d 192.168.40.21
[Cmd Runner] 2024/02/08 16:14:57 DEBUG - Stdout:
[Cmd Runner] 2024/02/08 16:14:57 DEBUG - Stderr: The specified entry was not found.^M
[Cmd Runner] 2024/02/08 16:14:57 DEBUG - Successful: true (0)
[NATS Handler] 2024/02/08 16:14:57 DEBUG - Cleaned ip-mac address cache for: 192.168.40.21.
[NATS Handler] 2024/02/08 16:14:57 DEBUG - Waiting to reconnect to nats.. Current attempt: 0, Connected: false
[NATS Handler] 2024/02/08 16:14:59 INFO - Sending hm message 'alert'
[NATS Handler] 2024/02/08 16:14:59 DEBUG - Message Payload
********************
{"id":"kubelet","severity":2,"title":"kubelet (172.20.2.8) - pid failed - Start","summary":"exited with code 1","created_at":1707408899}
********************
[NATS Handler] 2024/02/08 16:14:59 DEBUG - Increased reconnect to: 4s
[NATS Handler] 2024/02/08 16:15:00 INFO - Sending hm message 'alert'
[NATS Handler] 2024/02/08 16:15:00 DEBUG - Message Payload
********************
{"id":"nsx-kube-proxy","severity":2,"title":"nsx-kube-proxy (172.20.2.8) - pid failed - Start","summary":"exited with code 0","created_at":1707408900}
********************
[Cmd Runner] 2024/02/08 16:15:02 DEBUG - Running command: arp -d 192.168.40.21
[Cmd Runner] 2024/02/08 16:15:02 DEBUG - Stdout:
[Cmd Runner] 2024/02/08 16:15:02 DEBUG - Stderr: The specified entry was not found.^M
[Cmd Runner] 2024/02/08 16:15:02 DEBUG - Successful: true (0)
[NATS Handler] 2024/02/08 16:15:02 DEBUG - Cleaned ip-mac address cache for: 192.168.40.21.
[NATS Handler] 2024/02/08 16:15:02 DEBUG - Waiting to reconnect to nats.. Current attempt: 0, Connected: false
[NATS Handler] 2024/02/08 16:15:07 DEBUG - Reconnected to 192.168.40.21:4222
[File System] 2024/02/08 16:15:18 DEBUG - Checking if file exists \var\vcap\monit\stopped
[File System] 2024/02/08 16:15:18 DEBUG - Stat '\var\vcap\monit\stopped'
[agent] 2024/02/08 16:15:18 DEBUG - Building heartbeat
[File System] 2024/02/08 16:15:18 DEBUG - Checking if file exists \var\vcap\bosh\spec.json
From the above example logs, NATS was disconnected for 10 seconds, between 16:14:57 and 16:15:07.
The issue can go unnoticed in the environment because the instance gets into a “running” state eventually, even though the scan & fix task actually failed. However, the “post-start” action doesn’t get executed though because the task failed.
Tanzu Kubernetes Grid Integrated Edition <= v1.20
The issue is that when the "ovsdbserver" job starts, it configures "ovs", which reconfigures the networking in the VM. The networking reconfiguration during the "start" stage causes the NATS connectivity issue as logged by the bosh agent.
To address this permanently, the networking reconfiguration by "ovs" must be moved from the "start" stage to the "pre-start" stage of the VM lifecycle. This is planned to be fixed in TKGI v1.21, and to be patched in v1.20.x.
As a temporary workaround (which won't persist through cluster upgrades/updates), the bosh manifest of the cluster deployment can be modified to increase the "update_watch_time" setting for the windows-worker instances. The following steps can be followed for this temporary workaround:
$ bosh -d $DEPLOYMENT manifest > cluster-manifest.yaml
update_watch_time: 10000-300000
Change the setting accordingly:
update_watch_time: 20000-300000
Save the file after changing the above.
$ bosh -d $DEPLOYMENT deploy cluster-manifest.yaml