In a vSAN stretched cluster or 2-node configuration, the following symptoms are observed:
Multiple ESXi hosts enter a "Not Responding" state in vCenter Server.
Extreme VM-level write latency occurs while backend vSAN latency remains negligible (e.g., ~0.218 ms), indicating phantom latency.
Multiple VM High Availability (HA) events and guest OS unresponsiveness (stuck I/O) occur.
The CMMDSResolver reports the Witness interface as unhealthy (numAddresses=0).
DOM2PCPrintDescriptor records multiple stuck descriptors coinciding with guest-level unresponsiveness.
vSAN 8.x
This issue is caused by network instability on the link between the data sites and the vSAN Witness site. Network metrics may confirm high packet drop rates (e.g., 30.3% portRxDrops) during maintenance windows or network degradation.
When the Witness node fails heartbeats due to this instability, it partitions from the cluster. If the network condition is intermittent, the Witness node may rapidly rejoin and re-partition (flapping). This rapid cycling (e.g., cluster membership repeatedly fluctuating between N and N-1 members within seconds) forces the cluster into unstable synchronous operation loops. vSAN is prevented from stabilizing its object states, causing I/O operations to queue indefinitely at the Distributed Object Manager (DOM) layer, which leads to stuck descriptors and guest OS unresponsiveness.
Verify that the Witness network link meets VMware's mandatory requirements:
Ensure that the following essential vSAN ports are not being intermittently dropped by security appliances or firewalls during high-load windows:
Check the vmkernel.log on the affected hosts for the following entries to confirm recovery: