vSAN Cluster Partition and host flapping when RDMA is enabled on unsupported NICs
search cancel

vSAN Cluster Partition and host flapping when RDMA is enabled on unsupported NICs

book

Article ID: 431690

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

The vSAN cluster experiences degraded performance, host flapping, and "Network Partition" or "vSAN Cluster Partition" health check failures following a cluster-wide reboot.
Standard network diagnostics, such as ICMP pings with 9000 MTU jumbo frames, succeed between all hosts.
Physical NIC counters report no dropped packets, flow control events, or CRC errors.

Environment

VMware vSAN (7.0 U2 and later)
VMware vSphere Foundation (VVF)
VMware Cloud Foundation (VCF)

Cause

Remote Direct Memory Access (RDMA) is enabled at the vSAN cluster level, but the physical Network Interface Cards (NICs) lack hardware support for the protocol (e.g., RoCE v2).
High network utilization, for example a vSAN resynchronization or other high load traffic can causes the unsupported RDMA transport stack to fail even if it was working previously.
This leads to intermittent transport failures and host partitioning, even though underlying Layer 2/3 TCP/IP traffic remains unaffected and standard pings succeed.

Resolution

To resolve the network partition, RDMA must be disabled to force the vSAN transport stack to fall back to the native TCP protocol.

  1. Log in to the vSphere Client.

  2. Select the affected vSAN Cluster in the inventory.

  3. Navigate to Configure > vSAN > Services.

  4. In the Network section, click Edit.

  5. Disable the toggle for RDMA support and click Apply.

  6. One host at a time, Place the affected host(s) into Maintenance Mode (Ensure Accessibility) and perform a host reboot to clear any hung network states or driver inconsistencies.

Additional Information

Disabling RDMA forces vSAN to use the standard TCP stack, which is natively supported by the existing physical NICs. The brief protocol cutover may cause a momentary pause in packet processing, so this change should ideally be performed during a maintenance window.

For further details on RDMA usage, prerequisites, and verifying hardware compatibility, please refer to the official documentation: