vSAN Stretched Cluster Instability and Stuck I/O Due to Witness Node Flapping
search cancel

vSAN Stretched Cluster Instability and Stuck I/O Due to Witness Node Flapping

book

Article ID: 438087

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

In a vSAN stretched cluster or 2-node configuration, the following symptoms are observed:

  • Multiple ESXi hosts enter a "Not Responding" state in vCenter Server.

  • Extreme VM-level write latency occurs while backend vSAN latency remains negligible (e.g., ~0.218 ms), indicating phantom latency.

  • Multiple VM High Availability (HA) events and guest OS unresponsiveness (stuck I/O) occur.

  • The CMMDSResolver reports the Witness interface as unhealthy (numAddresses=0).

  • DOM2PCPrintDescriptor records multiple stuck descriptors coinciding with guest-level unresponsiveness.

Environment

vSAN 8.x

Cause

This issue is caused by network instability on the link between the data sites and the vSAN Witness site. Network metrics may confirm high packet drop rates (e.g., 30.3% portRxDrops) during maintenance windows or network degradation.

When the Witness node fails heartbeats due to this instability, it partitions from the cluster. If the network condition is intermittent, the Witness node may rapidly rejoin and re-partition (flapping). This rapid cycling (e.g., cluster membership repeatedly fluctuating between N and N-1 members within seconds) forces the cluster into unstable synchronous operation loops. vSAN is prevented from stabilizing its object states, causing I/O operations to queue indefinitely at the Distributed Object Manager (DOM) layer, which leads to stuck descriptors and guest OS unresponsiveness.

Resolution

1. Immediate Mitigation

  • Stabilize Network Path: Ensure the network path to the vSAN Witness node has sufficient bandwidth and 0% packet loss.
  • Isolate Flapping Witness: If network maintenance is ongoing and stability cannot be guaranteed, consider temporarily isolating the Witness node to stop the flapping cycle until the maintenance is complete.

2. Permanent Resolution (Network Requirements)

Verify that the Witness network link meets VMware's mandatory requirements:

  • Latency: Round Trip Time (RTT) must be < 200ms.
  • Packet Loss: The link must maintain 0% packet loss.
  • Bandwidth: Ensure the link meets the minimum bandwidth thresholds based on the number of components in your cluster.

3. Firewall and Port Audit

Ensure that the following essential vSAN ports are not being intermittently dropped by security appliances or firewalls during high-load windows:

  • UDP 12321: vSAN Heartbeats.
  • UDP 2233: vSAN Transport.

4. Verification

Check the vmkernel.log on the affected hosts for the following entries to confirm recovery:

  • Successful heartbeat re-establishment with the Witness.
  • Clearance of DOM2PCPrintDescriptor "Stuck descriptors."

Additional Information