vSAN witness randomly disconnects or flaps due to physical NIC packet drops
search cancel

vSAN witness randomly disconnects or flaps due to physical NIC packet drops

book

Article ID: 438041

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • A vSAN stretched cluster or 2-node cluster reports partition health errors.
  • The vSAN witness host repeatedly drops and rejoins the cluster membership, causing flapping connections.
  • vmkernel logs report errors such as the following: 
    In(182) vmkernel: cpu33:2099855)CMMDS: CMMDSStateMachineReceiveLoop:1654:########-####-####-####-############: Error receiving from ########-####-####-####-############: Failure
    In(182) vmkernel: cpu33:2099855)CMMDS: CMMDSStateDestroyNode:########-####-####-####-############: Destroying node ########-####-####-####-############: Failed to receive from node
    In(182) vmkernel: cpu33:2099855)CMMDS: LeaderRemoveNodeFromMembership:########-####-####-####-############: Removing node 6########-####-####-####-############ (vsanNodeType: witness) from the cluster membership
    In(182) vmkernel: cpu33:2099855)CMMDS: CMMDSClusterDestroyNodeImpl:262: Destroying node ########-####-####-####-############ from the cluster db. Last HB received from node - 17521750521575764
    In(182) vmkernel: cpu33:2099855)CMMDS: CMMDSUtil_PrintArenaEntry:98: ########-####-####-####-############: [1466218441]:Adding a new Membership entry (########-####-####-####-############) with 18 members:
  • vSAN health alarm for Cluster Partition

Environment

VMware vSAN (All versions)

Cause

The physical network interface card (NIC) on the ESXi host is experiencing ring buffer exhaustion or hardware-level packet drops.
This prevents the delivery of UDP-based CMMDS heartbeats required to maintain vSAN cluster membership, causing the witness host to temporarily drop out of the cluster.
Outdated NIC drivers and firmware frequently contain bugs related to buffer management and packet processing under load.

Resolution

 

  1. Verify network packet drops at the ESXi host level using the following command via SSH (replace <vmnic_name> with your actual interface, e.g., vmnic0): esxcli network nic stats get -n <vmnic_name> 

  2. Review the output for increasing Receive/Transmit packets dropped, Rx/TX Errors, Rx/TX Dropped, or ring full events.

  3. Identify the current NIC driver and firmware version on the host using: esxcli network nic get -n <vmnic_name> 

  4. Navigate to the VMware Compatibility Guide (VCG) for IO Devices

  5. Search for the specific NIC model (e.g., Broadcom, Intel, Mellanox) and confirm the host is on the latest certified combination listed for your ESXi release.

  6. If packet drops or ring full events continue after applying the latest certified firmware and driver, engage the hardware vendor to assess whether the NIC ring buffer sizes can be safely increased or to investigate upstream physical switch port congestion.

Ensure that physical switchports connecting to the ESXi host are properly configured and not dropping packets due to flow control mismatches, MTU sizing issues, or oversubscription.

 

Additional Information

Determining Network/Storage firmware and driver version in ESXi

Download and install async drivers in VMware ESXi