Enabling RDMA on the vSAN cluster results in the cluster in complete cluster network partition when the host meet the following requirements.
ESX hosts have more than 1TB of memory
Broadcom NICs BCM57414, BCM57416, BCM57508, BCM57454, BCM57504, BCM57412, BCM57502, BCM57417
bnxtnet driver version equal to or below 234.0.147.0
Upgrading memory to be more than 1TB on a server in a vSAN cluster with RDMA already enabled results in that host partitioning from the cluster and may even PSOD with the below backtrace:
2025-05-02T03:30:30.071Z cpu47:2098189)@BlueScreen: #PF Exception 14 in world 2098189:rdmaMADPortP IP 0x42001d69aa2b addr 0x50f
PTEs:0x0;
2025-05-02T03:30:30.072Z cpu47:2098189)Code start: 0x42001c400000 VMK uptime: 0:00:43:57.069
2025-05-02T03:30:30.072Z cpu47:2098189)0x453a5da1bde0:[0x42001d69aa2b]ib_mad_completion_handler@com.vmware.rdma#1+0x7f stack: 0x4315cda1d6f8
2025-05-02T03:30:30.072Z cpu47:2098189)0x453a5da1bf20:[0x42001d69da12]RDMATicketWorkerFunction@com.vmware.rdma#1+0x1f stack: 0x42001d69da0c
2025-05-02T03:30:30.073Z cpu47:2098189)0x453a5da1bf60:[0x42001c55b8bf]HelperQueueFunc@vmkernel#nover+0x300 stack: 0x431588802c88
2025-05-02T03:30:30.073Z cpu47:2098189)0x453a5da1bfe0:[0x42001cad67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0
2025-05-02T03:30:30.073Z cpu47:2098189)0x453a5da1c000:[0x42001c544cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2025-05-02T03:30:30.084Z cpu47:2098189)base fs=0x0 gs=0x42004bc00000 Kgs=0x0
2025-05-02T03:30:30.084Z cpu47:2098189)CPU model name: Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, FMS: 06/55/7, uCodeRev: 5003801
2025-05-02T03:30:30.084Z cpu47:2098189)PRODUCTNAME:ProLiant DL380 Gen10, VENDORNAME:HPE, SERIAL_NUMBER:##########, SERVER_UUID:37393150-3931-4d32-3232-############, VERSION:, SKU:P19719-B21, FAMILY:ProLiant
2025-05-02T03:30:30.084Z cpu47:2098189)BIOS_VENDOR:HPE, BIOS_VERSION:U30, BIOS_RELEASE_DATE:02/21/2025
2025-05-02T03:30:30.084Z cpu47:2098189)vmkernel 0x0 .data 0x0 .bss 0x0
2025-05-02T03:30:30.084Z cpu47:2098189)procfs 0x42001d00f000 .data 0x41ffc0000000 .bss 0x41ffc0000380
2025-05-02T03:30:30.084Z cpu47:2098189)vmkapi_v2_12_0_0_vmkernel_shim 0x42001d012000 .data 0x41ffc0400000 .bss 0x41ffc041a108
2025-05-02T03:30:30.084Z cpu47:2098189)vmkapi_v2_12_0_0_only_shim 0x42001d01a000 .data 0x41ffc0800000 .bss 0x41ffc080026f
2025-05-02T03:30:30.084Z cpu47:2098189)vmkapi_v2_11_0_0_vmkernel_shim 0x42001d01b000 .data 0x41ffc0c00000 .bss 0x41ffc0c19dc8
2025-05-02T03:30:30.084Z cpu47:2098189)vmkapi_v2_11_0_0_only_shim 0x42001d022000 .data 0x41ffc1000000 .bss 0x41ffc10000e2
2025-05-02T03:30:30.084Z cpu47:2098189)vmkapi_v2_10_0_0_vmkernel_shim 0x42001d023000 .data 0x41ffc1400000 .bss 0x41ffc1417e40
VMware vSAN
vSAN RDMA
bnxtnet driver version equal to or below 234.0.147.0
Server memory greater then 1TB
Broadcom NICs BCM57414, BCM57416, BCM57508, BCM57454, BCM57504, BCM57412, BCM57502, BCM57417
bnxtnet driver version equal to or below 234.0.147.0 doesn't support RDMA when server memory is above 1TB, causing the vSAN cluster to become partitioned or hosts to PSOD if mixed memory sizes in servers.
For example if the servers currently have 750GB of memory and then you double the memory for one of the hosts to 1.5TB that host will partition from the cluster and then PSOD.
This has been addressed in bnxtroce driver 235.1.140.0, engage your respective hardware vendor if you don't see this driver listed as the drive is still being vetted by all OEM vendors.
There are 2 options at this time.
1) Disable RDMA if the server memory is greater than 1TB, once the updated driver with the fix is available re-enable RDMA
2) Keep the server memory at or below 1TB when using RDMA