ESXi host goes into PSOD when driver nmlx5-rdma VIB is missing, psod can occur following upgrade attempt.
search cancel

ESXi host goes into PSOD when driver nmlx5-rdma VIB is missing, psod can occur following upgrade attempt.

book

Article ID: 371966

calendar_today

Updated On:

Products

VMware vSphere ESXi 7.0

Issue/Introduction

PSOD occurred due to an invalid memory access that happened in nmlx5_rdma_Attach

 

Environment

VMware ESXi 7.0.3 [Release Build-23794027]
VMware ESXi 7.0.3 [Release Build-23307199] 

Cause

1. The PSOD is happening after updating ESXi & device drivers, ESXi was rolled back, and the system booted to release from PSOD. The nmlx5_rdma module isn't installed in the current state, but checking /altbootbank/boot.cfg, nmlx5_rd was there. So, it was installed, and it prevented the system from booting.
2. It seemed invalid memory access happened in nmlx5_rdma_Attach(). Upon checking nmlx5-core and nmlx5-rdma are the same version and compatible versions for the vmnic. The driver nmlx5-rdma was missing in the problematic host when compared to a working host with the same hardware and ESXi build. 
3. This is a clear mismatch of nmlx5_core and nmlx5_rdma drivers. They need to be on the same version to be compatible. If not, it can lead to unexpected behavior (PSODs) 
4. Missing nmlx5-rdma driver will also cause the PSOD. 

Resolution

1. Updated the NIC drivers and firmware to the latest version for ESXi 7.0 U3 EP8, Build: 21313628: nmlx5_core version 4.22.73.1006-1OEM

Previously installed driver on problematic ESXi host: Here we found required nmlx5-rdma VIB was missing 

Working host: Here we noticed required nmlx5-rdma VIB was present

2. After the updating required nmlx5-rdma VIB on affected host, all the VIBs are showing correctly installed:

3. ESXi host updated to latest patch build ESXi 7.0 U3 23794027 and observed no PSOD occurrence. 

Additional Information

PSOD Backtrace:

#PF Exception 14 in world 2099415:vnkdevmgr IP 0x######### addr 0x42ac PTES: 0x20041a08027:0x10036f75027:0x0;
cr0=0x80010031 cr2=0x42ac cr3=0x10037223000 cr4=0x140768
FMS=19/01/1 uCode=0xa0011d1
frame=0x453ab471b310 ip=######### err=0x0 rflags=0x10206 rax=0x4315adc031c0 rbx=0x4315adc02d00 rcx=0x0
rdx=0x3e rbp=0x42001ea5320c rsi=0x4315adc02d00
rdi=0x0 r8=0x100 r9=0x0
r10=0x250 r11=0x430381e00e58 r12=0x42001ea59ac0
r13=0x5562430d88016459 r14=0x42001ea59aa0 r15=0x0
*PCPU29:2099415/vmkdevngr
PCPU 8: SSSSSUSSISSSSSSSSUSSUSUSSSSI IUSSSSUSSSISSSSSUSUSUSUSSSUSSSSSSSSS PCPU 64: SSSUUSSUSSSSUSSSUSUSSUSUISSSSSSIISUSSUSSUSSSSSSUUSUSSUUSUSSSSISS PCPU128: SSSSUISSSUUSUUSSUSSssssssssssssssUSSSUSSUUSSUSSUSIUUSIISSSSUSSS

PCPU192: SISUUSSUUSSSSIUSSUSSUSSSIUSSSUSIUSUSSSSSUSSSSUUSSUUUSSSSSSSSSSSS Code start: 0x42001d600000 VMK uptime: 0:00:00:27.212
0x453ab471b3d0: [0x42001ea40ec8Jnm1x5_rdma_Attach@(nm1x5_rdma)#<None>+0x5e4 stack: 0x3 0x453ab471b490: [0x42001d619ca1]Driver_AnnounceDevice@vmkernel #nover +0x1ce stack: 0x3eb471b5b8 0x453ab471b510: [0x42001d616dee ]DeviceBind@vnkerne1#nover+0x16b stack: Oxbad0007 0x453ab471b550: [0x42001d6182cb]DeviceVSIB ind@vnkerne1#nover+0xcc stack: 0x453ab471b5b0 0x453ab471b580: [0x42001d6182ed]Device_VSISetDriver@vnkernel#nover+0x1e stack: 0x453ab471b5b8 0x453ab471b5a0: [0x42001d6023dd JVSI_Set Info@vnkernel #nover+0x2ca stack: 0x453ab471b6a0 0x453ab471b620: [0x42001db2bcca JUH64VMKSyscallUnpackVSI_Set@vmkernel#nover +0xleb stack: 0x0
0x453ab471bee0: [0x42001dab542eJUser_UWVMK64SyscallHandler@vnkernel #nover +0x183 stack: 0x6357babaladac235 0x453ab471bf40: [0x42001d74b638]Syscal1UWVMK64@vnkernel #nover+0x90 stack: 0x0
base fs=0x0 gs=0x420047400000 Kgs=0x0 

Below results of drivers in comparison between working and PSOD affected ESXi host: 

Problematic host:

$ localcli software vib list

Name  Version  Vendor Acceptance Level     Install Date

nmlx5-core 4.22.73.1004-1OEM.703.0.0.18644231 MEL VM wareCertified 2024-06-26
nmlx4-core 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2024-06-26
nmlx4-en 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2024-06-26
nmlx4-rdma 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2024-06-26

Working host:

$ localcli software vib list

Name  Version  Vendor Acceptance Level     Install Date

nmlx5-core 4.22.73.1004-1OEM.703.0.0.18644231 MEL VM wareCertified 2023-08-29
nmlx5-rdma 4.22.73.1004-1OEM.703.0.0.18644231 MEL VM wareCertified 2023-09-11
nmlx4-core 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2023-08-29
nmlx4-en 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2023-08-29
nmlx4-rdma 3.19.16.8-2vmw.703.0.20.19193900 VMW VM wareCertified 2023-08-29