Symptoms:
Running compute only cluster mounting vSAN datastore
VMs on a specific host show inaccessible, potentially appear running and become inaccessible when powered off
VMs fail vMotion to other hosts with timeouts and error
Failed waiting for data. Error 195887167. Connection closed by remote host, possibly due to timeout. 2024-11-13T23:11:09.50087Z VMotionStream [175250081:858948631672480427] failed to lookup swapfile: Failure. Migration to host <##.###.##.###> failed with error Failure (195887105). 2024-11-13T23:11:09.527196Z vMotion migration [175250081:858948631672480427] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
There are no network partition alerts in Skyline Health
The host can see the vSAN datastore
Running ls or attempting to enter a namespace in the datastore returns an IO error
vSAN datastore is functioning normally on other hosts, VMs powered off and removed from inventory from problem host are able to be registered and powered on from other hosts
vmkping on the vSAN tagged vmkernel port on the problem host cannot reach the other hosts vSAN IPs
Impact:
Virtual machines fail to run on the impacted host
Identification:
vobd log shows vmnic come online and vSAN namespace object heartbeats begin failing. Once offline the heartbeats recover.
Example messages - times and details including actual vmnic will vary:
2024-11-12T15:43:03.027Z: [netCorrelator] 133094369261us: [vob.net.dvport.uplink.transition.up] Uplink: vmnic0 is up. Affected dvPort: 4436/50 0c 2e 4f f4 82 1d 8f-c3 fc 56 20 3b 2f ea c4. 2 uplinks up
...
2024-11-12T15:43:13.404Z: [vmfsCorrelator] 1330954078562us: [vob.vmfs.heartbeat.timedout] 628b2dc6-########-####-############ c62f8f62-####-####-####-############
....
2024-11-12T17:02:03.034Z: [netCorrelator] 133094369261us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic0 is down. Affected dvPort: 4436/50 0c 2e 4f f4 82 1d 8f-c3 fc 56 20 3b 2f ea c4. 1 uplinks up. Failed Criteria: 128
....
2024-11-12T17:02:04.458Z: [vmfsCorrelator] 1330954078562us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume 628b2dc6-########-####-############ (c62f8f62-####-####-####-############): [Timeout]
vSAN mounted on a compute only cluster
A vmnic backing the vSAN vmk is failing to reach the vSAN network, even if others configured to back the same vmk can reach the network.
Offline or remove the problem vmnic from use on the vSAN vmkernel port or host.
Recommended that customer work with their network team to find cause of network adapter being unable to reach the correct network.