vSAN node PSODs with stuck IO after a disk failure
search cancel

vSAN node PSODs with stuck IO after a disk failure

book

Article ID: 385443

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Log Message Indicating Stuck IO:

cpu107:2098596)NVMEDEV:9363 recover controller 256
cpu107:2098596)NvmeDiscover: 6804: Scan operation 2 received on adapter vmhba1
cpu107:2098596)NvmeDiscover: 4724: Controller nqn.2021-08.com.intel:############## fuseOp 0 oncs 4e cmic 0 nscnt 80
cpu92:2097959)NvmeDeviceIO: 1973: cleanup from TM handler world
cpu92:2097959)NvmeDeviceIO: 150: StuckIoCounter for t10.NVMe____####_Ent_NVMe_#####_MU_U.2_1.6TB________################## : 0. Clearing PSA_STOR_DEVICE_FLAG_STUCK_IO_COND
cpu92:2097959)NvmeUtil: 428: Transient status for command 0x1 set to VMK_TIMEOUT because the timeout has expired: cmdId.initiator=0x430b28edf500 cmdId.serialNumber=0xb0828277)

 

Backtrace:
Wa(180) vmkwarning: cpu69:2097851)WARNING: Lock: 1660: (held by 0: Spin count exceeded 1 time(s) - possible deadlock.
In(182) vmkernel: cpu69:2097851)0x453a15d9bd90:[0x42000aa23e47]Lock_CheckSpinCount@vmkernel#nover+0x157 stack: 0xffffffffffffffef
In(182) vmkernel: cpu69:2097851)0x453a15d9bde0:[0x42000ab2453c]SP_WaitLock@vmkernel#nover+0xdd stack: 0x1
In(182) vmkernel: cpu69:2097851)0x453a15d9be20:[0x42000ab245fc]SPLockWork@vmkernel#nover+0x29 stack: 0x45e##04d57c0
In(182) vmkernel: cpu69:2097851)0x453a15d9be30:[0x42000aad551d]AsyncPopCallbackFrameInt@vmkernel#nover+0x1e stack: 0x45e##04d57c0
In(182) vmkernel: cpu69:2097851)0x453a15d9be60:[0x42000aef5f8d]PsaNVMe_AsyncTokenIODone@vmkernel#nover+0x76 stack: 0x430b1e6dac00
In(182) vmkernel: cpu69:2097851)0x453a15d9bea0:[0x42000af06095]PsaNvmeDeviceTimeoutHandlerFn@vmkernel#nover+0x3b2 stack: 0x99500000004
In(182) vmkernel: cpu69:2097851)0x453a15d9bf60:[0x42000aefe94d]PsaStorDeviceTimeoutHandlerFn@vmkernel#nover+0x62 stack: 0x0
In(182) vmkernel: cpu69:2097851)0x453a15d9bfa0:[0x42000af7d1eb]PsaStorTaskMgmtWorldFunc@vmkernel#nover+0x8c stack: 0x453a12c9f100
In(182) vmkernel: cpu69:2097851)0x453a15d9bfe0:[0x42000ae2c##9]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
In(182) vmkernel: cpu69:2097851)0x453a15d9c000:[0x42000aadbe7f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
Al(177) vmkalert: cpu0:2098139)ALERT: NMI: 743: NMI IPI: RIPOFF(base):RBP:CS [0x9##98(0x42000aa00000):0x1:0x748] (Src 0x4, CPU0)
In(182) vmkernel: cpu0:2098139)0x453a1c79baf8:[0x42000aa9##97]Power_ArchPerformWait@vmkernel#nover+0xd4 stack: 0x420040001880
In(182) vmkernel: cpu0:2098139)0x453a1c79bb00:[0x42000aa938e9]Power_ArchSetCState@vmkernel#nover+0xba stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79bb50:[0x42000ae263b1]CpuSchedIdleLoopInt@vmkernel#nover+0x292 stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79bbc0:[0x42000ae2aa1c]CpuSchedDispatch@vmkernel#nover+0x1f21 stack: 0x453a00000001
In(182) vmkernel: cpu0:2098139)0x453a1c79be00:[0x42000ae2b441]CpuSchedWait@vmkernel#nover+0x362 stack: 0x800000000000006f
In(182) vmkernel: cpu0:2098139)0x453a1c79bf70:[0x42000acdd9ad]NetPollWorldCallback@vmkernel#nover+0x36 stack: 0x453a1bd10##5
In(182) vmkernel: cpu0:2098139)0x453a1c79bfe0:[0x42000ae2c##9]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79c000:[0x42000aadbe7f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
In(182) vmkernel: cpu106:2101626)PLOG: PLOG_CleanupDefence:8383: Waiting for issueBarrier for device 52767924-0ecd-4541-1612-2c25d309716c
In(182) vmkernel: cpu106:2101626)PLOG: PLOG_CleanupDefence:8383: Waiting for issueBarrier for device 52767924-0ecd-4541-1612-2c25d309716c

Cause

A command was detected as "Stuck IO", the command later completed. By the time the command completed, the objects related to that command was already freed up (by the time RED event was notified) and hence caused PSOD.

 

 

Resolution

This issue is resolved in ESXi 8.0 Update 3e (build number 24674464)