ESXi host crashes with a PSOD after I/O stuck on the NVMe device when it tries to re-acquire lock
search cancel

ESXi host crashes with a PSOD after I/O stuck on the NVMe device when it tries to re-acquire lock

book

Article ID: 408769

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • The host crashes unexpectedly and display this PSOD on screen. 


 

  •  LogEFI.log will have the highlighted events. 

LogEFI.log 

YYYY-MM-DDTHH:MM:SS.<Time_Zone> In(14) LogEFI: cpu0:16405476)VMware ESXi 8.0.3 [Releasebuild-24585383 x86_64]
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805] CPU 0 / World 16405476 tried to re-acquire lock
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805]
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805] Module(s) involved in panic: [vmkernel Version Releasebuild-24585383]
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)cr0=0x8001003d cr2=0x430000876000 cr3=0x605000 cr4=0x14216c
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)FMS=06/6a/6 uCode=0xd0003f5
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805] *PCPU0:16405476/DeviceTaskmgmtHandler
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805] PCPU  0: SSUUSUUUSSSUUUSISIIISSSSISISSSIIIISSUSSSSSSSSIUUUSSUUUSSSSSSIIUS
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI[2099805] PCPU 64: SSUSUSIISISSUSIUSUUIIISUISIIISSS
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)Code start: 0x42000aa00000 VMK uptime: 126:05:37:59.527
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bcc0:[0x42000ab7bbc0]PanicvPanicInt@vmkernel#nover+0x20c stack: 0x0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bd70:[0x42000ab7c33c]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453acd51bdd0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bdd0:[0x42000aa28982]LockCheckSelfDeadlockInt@vmkernel#nover+0x9f stack: 0x200000000
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bde0:[0x42000ab852bd]SP_WaitLock@vmkernel#nover+0x16a stack: 0x1
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51be20:[0x42000ab852f0]SPLockWork@vmkernel#nover+0x29 stack: 0x45e12e6a32c0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51be30:[0x42000b0c1df9]AsyncPopCallbackFrameInt@vmkernel#nover+0x1e stack: 0x45e12e6a32c0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51be60:[0x42000aebc699]PsaNVMe_AsyncTokenIODone@vmkernel#nover+0x76 stack: 0x430b1dd8e400
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bea0:[0x42000aeb84ad]PsaNvmeDeviceTimeoutHandlerFn@vmkernel#nover+0x3b2 stack: 0x100000004
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bf60:[0x42000af1d029]PsaStorDeviceTimeoutHandlerFn@vmkernel#nover+0x62 stack: 0x0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bfa0:[0x42000afc2a9f]PsaStorTaskMgmtWorldFunc@vmkernel#nover+0x8c stack: 0x453a5469f100
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51bfe0:[0x42000b0d67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)0x453acd51c000:[0x42000ab44cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
YYYY-MM-DDTHH:MM:SS.<Time_Zone> LogEFI: cpu0:16405476)base fs=0x0 gs=0x420040000000 Kgs=0x0

  • The ESXi host is encountering problems communicating with its NVMe storage devices. The NVMe devices are either resetting or aborting commands prematurely, leading to I/O failures that the ESXi host then tries to retry.

    /var/run/log/vmkernel.log 

    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x45e1307ffa70 cmdId.serialNumber=0x1)
    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x453ac699a6b8 cmdId.serialNumber=0x659006)
    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x45e1307ffa70 cmdId.serialNumber=0x1)
    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x453ac699a6b8 cmdId.serialNumber=0x659006)
    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x45e1307ffa70 cmdId.serialNumber=0x1)
    YYYY-MM-DDTHH:MM:SS.<Time_Zone> cpu50:2098517)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x453ac699a6b8 cmdId.serialNumber=0x659006)

Environment

VMware ESXi 8.0 Update 3d 

VMware ESXi 7.0 Update 3s

Cause

  • The PSOD happened because the thread found that the lock was already held by PCPU 0 when it tried to acquire the lock caused due to the I/O stuck on the NVME device. 

  • As per LogEFI.log events, CPU 0 was tried to re-acquire lock however the module undergone deadlock within it's storage subsystem due to the failure in communication with the PSA storage device issued timed out. 

  • This situation may trigger or exacerbate by an issue with the NVMe device itself or its due to the communication issue(e.g., an I/O operation to the NVMe device becoming "stuck" or timing out).

Resolution

      The fix is available in: 

  • ESXi 8.0 Update 3e(ESXi 8.0 P05) or above. 
  • ESXi 7.0 Update 3v(ESXi 7.0.3 P10) or above.