ESXi host experiences a PSOD after a vmnic failover when vSAN is used over RDMA
search cancel

ESXi host experiences a PSOD after a vmnic failover when vSAN is used over RDMA

book

Article ID: 433073

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x

Issue/Introduction

Symptoms:

  • vSAN is used over RDMA.
  • There was fail-over of actively used nics configured for vSAN, and soon the ESXi PSOD happened.
  • The /var/log/vmkernel.log has below entries:

YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx)RDT: RDTRDMAServerCMEventCB:2558: VMK_RDMA_CM_EVENT_ADDR_CHANGE event occured, cmID xxxxxx, eventType 14 cluster Protocol 2
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx)RDT: RDTRDMAServerCMEventCB:2561: RDMA properties changed
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx)WARNING: rdmaDriver: RDMAIsTeamUplinkChanged:3505: oldUplink = vmnicX newUplink = vmnicY
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx)RDT: RDTRdmaClientCMEventCB:3380: Dropped client connect event 14, new event 14 rdmaConn xxxxxx
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx)RDT: RDTRdmaClientCMEventCB:3380: Dropped client connect event 14, new event 14 rdmaConn xxxxxx
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx)RDT: RDTRdmaClientCMEventCB:3380: Dropped client connect event 14, new event 14 rdmaConn xxxxxx

YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx) opID=xxxxxx)RDT:RDTRDMAStopConnectionsForServer:995: waiting for 63 active connections to end
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx) opID=xxxxxx)RDT:RDTRDMAStopConnectionsForServer:998: Waiting for the connections to get terminated 63
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx) opID=xxxxxx)RDT: RDTRDMAStopConnectionsForServer:995: waiting for 7 active connections to end
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx) opID=xxxxxx)RDT: RDTRDMAStopConnectionsForServer:998: Waiting for the connections to get terminated 7
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx) opID=xxxxxx)RDT: RDTRDMAStopConnectionsForServer:995: waiting for 0 active connections to end
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx) opID=xxxxxx)RDT: RDTDestroyRDMAServer:2892: Calling server cmid destroy
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx) opID=xxxxxx)RDT: RDTCreateRDMAServer:2779: RDTCreateRDMAServer() exiting

YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx cpu89:xxxxxx)Backtrace for current CPU #89, worldID=xxxxxx, fp=xxxxxx
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx cpu89:xxxxxx)0x453af0e1bea0:[0x42002bce596e][email protected]#0.0.0.1+0x142 stack: 0x100000000000750, 0x0, 0x0, 0x420054000000, 0x430384801630
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx cpu89:xxxxxx)0x453af0e1bf70:[0x42002bcd422f][email protected]#0.0.0.1+0x58 stack: 0x72, 0x4336bee11ba0, 0x0, 0x420029b9f7f9, 0x72
YYYY-MM-DDTHH:MM:SS cpu80:xxxxxx cpu89:xxxxxx)0x453af0e1bfa0:[0x420029b9f7f8]vmkWorldFunc@vmkernel#nover+0x31 stack: 0x420029b9f7f4, 0x0, 0x453af0e1f000, 0x453aeea9f100, 0x453af0e1f100
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx cpu89:xxxxxx)0x453af0e1bfe0:[0x42002a0d67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x420029b44cf0, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx cpu89:xxxxxx)0x453af0e1c000:[0x420029b44cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS cpu65:xxxxxx cpu89:xxxxxx)ESC[45mESC[33;1mVMware ESXi 8.0.3 [Releasebuild-24280767 x86_64]ESC[0m
#PF Exception 14 in world xxxxxx:rdtNetworkWo IP xxxxxx addr 0x8

Environment

VMware vSphere ESXi 8

Cause

Due to a rare race condition, an ESXi host might fail with a purple diagnostic screen after a failover of the vmnic when vSAN is used over RDMA.

Resolution

This issue is resolved in ESXi 8.0u3i release. 

Additional Information

Release note for this issue

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/release-notes/esxi-update-and-patch-release-notes/vsphere-esxi-80u3i-release-notes.html