ESXi 7.0 or 8.0 experienced a PSOD.
In the vmkernel log the backtrace is similar to :
[YYYY-MM-DDTHH:MM:SS] cpu70:2097222)WARNING: Heartbeat: 827: PCPU 120 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x105483(0x420002200000):0x45d9c81ba8c0:0xf48] (Src 0x1, CPU120)
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bd48:[0x420002305482]MCSUnlockWork@vmkernel#nover+0x2b stack: 0x45d9c81ba750
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bd50:[0x420002670693]SCSIHandleRmwReadComplete@vmkernel#nover+0x4e0 stack: 0x1dc31a
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bdf0:[0x42000266f396]SCSICompletePathCommand@vmkernel#nover+0x287 stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819beb0:[0x42000265b0fe]SCSICompleteAdapterCommand@vmkernel#nover+0x46b stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bf60:[0x4200030972bb]ioComp_wld@(lsi_mr3)#<None>+0x5c stack: 0x431162c03900
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bf90:[0x42000231dfd0]vmkWorldFunc@vmkernel#nover+0x49 stack: 0x42000231dfcc
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bfe0:[0x4200025b33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819c000:[0x4200022c4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)WARNING: Lock: 1658: (held by 120: Spin count exceeded 1 time(s) - possible deadlock.
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339ad60:[0x42000221b945]Lock_CheckSpinCount@vmkernel#nover+0x23d stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339adb0:[0x420002304838]MCSLockSpin@vmkernel#nover+0x69 stack: 0x43097c2314c8
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339ade0:[0x420002304e6e]MCSLockWait@vmkernel#nover+0x153 stack: 0x43097c176440
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339ae00:[0x4200023053bd]MCSLockWork@vmkernel#nover+0x2a stack: 0x431162e8c000
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339ae10:[0x42000266dbc4]SCSIStartPathCommands@vmkernel#nover+0x4b9 stack: 0x43097c231ca0
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339af40:[0x42000267378a]SCSIIssueAsyncPathCommandDirect@vmkernel#nover+0x2ab stack: 0x43097c2bcd00
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b040:[0x4200026756ff]vmk_ScsiIssueAsyncPathCommandDirect@vmkernel#nover+0x1c stack: 0x45d9c80ac3c0
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b060:[0x420002fedb43]HppIssueCommand@(hpp)#<None>+0x108 stack: 0x45396f81f000
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b0f0:[0x420002fee0cd]HppDeviceStartCommandProcessing@(hpp)#<None>+0x18a stack: 0x42005f800080
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b560:[0x420002608d3e]ScsiPluginDeviceStartCommand@vmkernel#nover+0x4f stack: 0x28
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b5a0:[0x42000262d922]SCSIStartDeviceInt@vmkernel#nover+0x37 stack: 0x45d9c80abf88
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b5c0:[0x420002634b4a]SCSIDirectSubmitOrQueueCommand@vmkernel#nover+0x19f stack: 0xaaaaaaaaaaab
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b630:[0x4200026353b8]SCSIAsyncDeviceCommand@vmkernel#nover+0x295 stack: 0x109
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b700:[0x42000261ea7a]PsaScsi_AsyncIO@vmkernel#nover+0x43 stack: 0x43097c32ad10
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b740:[0x42000224c3b6]FDS_AsyncIO@vmkernel#nover+0x6a3 stack: 0x45d9919e5a58
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b7e0:[0x4200033eae0c]Fil6_HandleReadIO@esx#nover+0x135 stack: 0x100000
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b840:[0x4200033edb7b]Fil6_FileIOInt@esx#nover+0xe9c stack: 0x43139e1401f0
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339b9b0:[0x4200033eef50]Fil6_FileIOIntWithRetry@esx#nover+0x55 stack: 0x1
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339ba50:[0x4200033ef2ab]Fil6_FileIO@esx#nover+0x138 stack: 0x453994c1f000
[YYYY-MM-DDTHH:MM:SS] cpu187:2105893)0x45399339bb50:[0x42000223b5a7]FSSVec_FileIO@vmkernel#nover+0x20 stack: 0x430bdf360550
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x105483(0x420002200000):0x45d9c81ba8c0:0xf48] (Src 0x4, CPU120)
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bd48:[0x420002305482]MCSUnlockWork@vmkernel#nover+0x2b stack: 0x45d9c81ba750
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bd50:[0x420002670693]SCSIHandleRmwReadComplete@vmkernel#nover+0x4e0 stack: 0x1dc31a
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bdf0:[0x42000266f396]SCSICompletePathCommand@vmkernel#nover+0x287 stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819beb0:[0x42000265b0fe]SCSICompleteAdapterCommand@vmkernel#nover+0x46b stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bf60:[0x4200030972bb]ioComp_wld@(lsi_mr3)#<None>+0x5c stack: 0x431162c03900
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bf90:[0x42000231dfd0]vmkWorldFunc@vmkernel#nover+0x49 stack: 0x42000231dfcc
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819bfe0:[0x4200025b33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu120:2099575)0x45398819c000:[0x4200022c4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
In PSA RMW code, the write operation is submitted as part of read completion.
When submitting the write request, if we have other requests that we service in issue path or there is a delay due to lock contention, the write operation can be completed before the controller returns to the issuer and here when we access the frame that could have got freed or reused, we can hit into PSOD.
The fix is to not reuse the IO frame as the IO can complete before the control returns back to the user.
There is no workaround.
This issue is fixed in the following version of ESXi:
Version | Release Name | Release Date | Build Number |
ESXi 7.0 P09 | ESXi 7.0 Update 3q | 2024/05/21 | 23794027 |
ESXi 8.0 U2 P03 | ESXi 8.0 Update 2b | 2024/02/29 | 23305546 |