vSAN node PSODs with stuck IO after a disk failure
search cancel

vSAN node PSODs with stuck IO after a disk failure

book

Article ID: 385443

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Purple Screen look similar to:

Version Details: VMware ESXi 8.0.3 build-24585383

Panic Message: @BlueScreen: NMI IPI: Panic requested by another PCPU. PC 0x420017690e24, SP 0x453bc321ba28 (Src 0x4, CPU0)
Backtrace:
  0x452a00002d30:[0x42001777bbc0]PanicvPanicInt@vmkernel#nover+0x20c stack: 0x30, 0x42001777bbc0, 0x0, 0x420000000001, 0x42001777bbc0
  0x452a00002de0:[0x42001777c396]Panic_WithBacktrace@vmkernel#nover+0x57 stack: 0x452a00002e50, 0x452a00002e00, 0x453bc321f000, 0x452a00002eaf, 0x420017690e24
  0x452a00002e50:[0x4200177782a1]NMI_Interrupt@vmkernel#nover+0x516 stack: 0x15ac, 0x11d8e, 0x29800000002, 0xfffffffffffffffc, 0x11d93
  0x452a00002f10:[0x420017ca0404]IDTNMIWork@vmkernel#nover+0x95 stack: 0x0, 0x420017ca186d, 0x0, 0x420017c9b0c7, 0x750
  0x452a00002f30:[0x420017ca186c]Int2_NMI@vmkernel#nover+0x9 stack: 0x750, 0x750, 0x0, 0x0, 0x0
  0x452a00002f40:[0x420017c9b0c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x0, 0x0, 0x0, 0x1
  0x453bc321ba28:[0x420017690e24]Power_ArchPerformWait@vmkernel#nover+0xd4 stack: 0x420040001880, 0x0, 0x0, 0x420040000000, 0x420040000000
  0x453bc321ba30:[0x420017690f75]Power_ArchSetCState@vmkernel#nover+0xba stack: 0x0, 0x0, 0x420040000000, 0x420040000000, 0x0
  0x453bc321ba80:[0x420017cd4111]CpuSchedIdleLoopInt@vmkernel#nover+0x292 stack: 0x0, 0x7fffffffffffffff, 0x1, 0x7fffffffffffffff, 0x453be469f100
  0x453bc321baf0:[0x420017cd863c]CpuSchedDispatch@vmkernel#nover+0x1e31 stack: 0x452200000001, 0x420040001040, 0x420040001110, 0x420040001128, 0x420040001040
  0x453bc321bd30:[0x420017cd904e]CpuSchedWait@vmkernel#nover+0x35b stack: 0x8000000000000001, 0x0, 0x101000000000f10, 0x74, 0x190
  0x453bc321bea0:[0x420017cd93ae]CpuSchedTimedWait@vmkernel#nover+0xb7 stack: 0x0, 0x2e45f7a, 0x453b7809bcc0, 0x420040006c38, 0x420017cd4e50
  0x453bc321bf40:[0x420017a707ec]EventQ_TimedWait@vmkernel#nover+0x3d stack: 0x430dfb253550, 0x42001775b924, 0x430dfb2af290, 0x0, 0x430dfb081308
  0x453bc321bf60:[0x42001775b923]HelperQueueFunc@vmkernel#nover+0x364 stack: 0x430dfb081308, 0x453bc321f000, 0x0, 0x0, 0x0
  0x453bc321bfe0:[0x420017cd67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x420017744cf0, 0x0, 0x0, 0x0
  0x453bc321c000:[0x420017744cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
Saved backtrace from: pcpu 0 SpinLock spin out NMI
  0x453bc321ba28:[0x420017690e23]Power_ArchPerformWait@vmkernel#nover+0xd4 stack: 0x420040001880
  0x453bc321ba30:[0x420017690f75]Power_ArchSetCState@vmkernel#nover+0xba stack: 0x0
  0x453bc321ba80:[0x420017cd4111]CpuSchedIdleLoopInt@vmkernel#nover+0x292 stack: 0x0
  0x453bc321baf0:[0x420017cd863c]CpuSchedDispatch@vmkernel#nover+0x1e31 stack: 0x452200000001
  0x453bc321bd30:[0x420017cd904e]CpuSchedWait@vmkernel#nover+0x35b stack: 0x8000000000000001
  0x453bc321bea0:[0x420017cd93ae]CpuSchedTimedWait@vmkernel#nover+0xb7 stack: 0x0
  0x453bc321bf40:[0x420017a707ec]EventQ_TimedWait@vmkernel#nover+0x3d stack: 0x430dfb253550
  0x453bc321bf60:[0x42001775b923]HelperQueueFunc@vmkernel#nover+0x364 stack: 0x430dfb081308
  0x453bc321bfe0:[0x420017cd67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0
  0x453bc321c000:[0x420017744cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

Log Message Indicating Stuck IO:

cpu107:2098596)NVMEDEV:9363 recover controller 256
cpu107:2098596)NvmeDiscover: 6804: Scan operation 2 received on adapter vmhba1
cpu107:2098596)NvmeDiscover: 4724: Controller nqn.2021-08.com.intel:############## fuseOp 0 oncs 4e cmic 0 nscnt 80
cpu92:2097959)NvmeDeviceIO: 1973: cleanup from TM handler world
cpu92:2097959)NvmeDeviceIO: 150: StuckIoCounter for t10.NVMe____####_Ent_NVMe_#####_MU_U.2_1.6TB________################## : 0. Clearing PSA_STOR_DEVICE_FLAG_STUCK_IO_COND
cpu92:2097959)NvmeUtil: 428: Transient status for command 0x1 set to VMK_TIMEOUT because the timeout has expired: cmdId.initiator=0x430b28edf500 cmdId.serialNumber=0xb0828277)

 

Backtrace:
Wa(180) vmkwarning: cpu69:2097851)WARNING: Lock: 1660: (held by 0: Spin count exceeded 1 time(s) - possible deadlock.
In(182) vmkernel: cpu69:2097851)0x453a15d9bd90:[0x42000aa23e47]Lock_CheckSpinCount@vmkernel#nover+0x157 stack: 0xffffffffffffffef
In(182) vmkernel: cpu69:2097851)0x453a15d9bde0:[0x42000ab2453c]SP_WaitLock@vmkernel#nover+0xdd stack: 0x1
In(182) vmkernel: cpu69:2097851)0x453a15d9be20:[0x42000ab245fc]SPLockWork@vmkernel#nover+0x29 stack: 0x45e##04d57c0
In(182) vmkernel: cpu69:2097851)0x453a15d9be30:[0x42000aad551d]AsyncPopCallbackFrameInt@vmkernel#nover+0x1e stack: 0x45e##04d57c0
In(182) vmkernel: cpu69:2097851)0x453a15d9be60:[0x42000aef5f8d]PsaNVMe_AsyncTokenIODone@vmkernel#nover+0x76 stack: 0x430b1e6dac00
In(182) vmkernel: cpu69:2097851)0x453a15d9bea0:[0x42000af06095]PsaNvmeDeviceTimeoutHandlerFn@vmkernel#nover+0x3b2 stack: 0x99500000004
In(182) vmkernel: cpu69:2097851)0x453a15d9bf60:[0x42000aefe94d]PsaStorDeviceTimeoutHandlerFn@vmkernel#nover+0x62 stack: 0x0
In(182) vmkernel: cpu69:2097851)0x453a15d9bfa0:[0x42000af7d1eb]PsaStorTaskMgmtWorldFunc@vmkernel#nover+0x8c stack: 0x453a12c9f100
In(182) vmkernel: cpu69:2097851)0x453a15d9bfe0:[0x42000ae2c##9]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
In(182) vmkernel: cpu69:2097851)0x453a15d9c000:[0x42000aadbe7f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
Al(177) vmkalert: cpu0:2098139)ALERT: NMI: 743: NMI IPI: RIPOFF(base):RBP:CS [0x9##98(0x42000aa00000):0x1:0x748] (Src 0x4, CPU0)
In(182) vmkernel: cpu0:2098139)0x453a1c79baf8:[0x42000aa9##97]Power_ArchPerformWait@vmkernel#nover+0xd4 stack: 0x420040001880
In(182) vmkernel: cpu0:2098139)0x453a1c79bb00:[0x42000aa938e9]Power_ArchSetCState@vmkernel#nover+0xba stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79bb50:[0x42000ae263b1]CpuSchedIdleLoopInt@vmkernel#nover+0x292 stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79bbc0:[0x42000ae2aa1c]CpuSchedDispatch@vmkernel#nover+0x1f21 stack: 0x453a00000001
In(182) vmkernel: cpu0:2098139)0x453a1c79be00:[0x42000ae2b441]CpuSchedWait@vmkernel#nover+0x362 stack: 0x800000000000006f
In(182) vmkernel: cpu0:2098139)0x453a1c79bf70:[0x42000acdd9ad]NetPollWorldCallback@vmkernel#nover+0x36 stack: 0x453a1bd10##5
In(182) vmkernel: cpu0:2098139)0x453a1c79bfe0:[0x42000ae2c##9]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
In(182) vmkernel: cpu0:2098139)0x453a1c79c000:[0x42000aadbe7f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
In(182) vmkernel: cpu106:2101626)PLOG: PLOG_CleanupDefence:8383: Waiting for issueBarrier for device ########-####-####-####-############
In(182) vmkernel: cpu106:2101626)PLOG: PLOG_CleanupDefence:8383: Waiting for issueBarrier for device ########-####-####-####-############

Environment

8.0 Update 3
8.0 Update 3a
8.0 Update 3b
8.0 Update 3c
8.0 Update 3d

 

 

Cause

A command was detected as "Stuck IO" and the command later completed. By the time the command completed, the objects related to that command were already freed up (by the time RED event was notified) and hence caused PSOD.

 

 

Resolution

This issue is resolved in ESXi 8.0 Update 3e (build number 24674464)

 

Additional Information