PSOD screen reports information similar to following:
PCPU X locked up. Failed to ack TLB invalidate (at least 1 locked up, PCPU(s): X).
PCPU(s) did not respond to NMI. Possible hardware problem; contact hardware vendor.
The vmkernel.log records that the NVIDIA device became unresponsive and reset before the PSOD.
YYYY-MM-DDTHH:MM:SS.536Z cpu4:2097455)WARNING: PCI: 740: Dev ####:##:##.1 is unresponsive after reset
YYYY-MM-DDTHH:MM:SS.154Z cpu8:2097387)WARNING: PCI: 740: Dev ####:##:##.2 is unresponsive after reset
In the logEFI.log located at /var/run/log you see following similar backtrace:
1mVMware ESXi 8.0.3 [Releasebuild-24859861 x86_64]
PCPU 48 locked up. Failed to ack TLB invalidate (at least 1 locked up, PCPU(s): 48).
Module(s) involved in panic: [vmkernel Version Releasebuild-24859861]
*PCPU76:2097285/tlbflushcount
PCPU 0: VSVSVVUSVVVVVVVVVSVVVVVSSVSVVVSVSVSSSSVUVSVIVSSSVUIVSVIUSVVIVSVV
PCPU 64: VSVVVVVVSVVSSSSVSVSVVVVVVVVSSVSVSVVSVSVSSSVSSVVSVSIVVISSSVSVUSUS
cpu76:)Code start: VMK uptime: 90:09:45:22.635
cpu76:):[]PanicvPanicInt@vmkernel#nover+0x20c stack:
cpu76:):[]Panic_NoSave@vmkernel#nover+0x4d stack:
cpu76:):[]TLBGetLockedCPUBacktraces@vmkernel#nover+0x265 stack: 0x0
cpu76:):[]TLBDoInvalidate@vmkernel#nover+0x240 stack:
cpu76:):[]TLBFlushCountForceFlush@vmkernel#nover+0xc4 stack: 0x0
cpu76:):[]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0
cpu76:):[]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
cpu76:)base fs=0x0 gs= Kgs=0x0
cpu48:)NMI: 738: NMI IPI: PC , SP (Src 0x1, CPU48)
cpu76:)Possible hardware problem: 2 PCPU(s) [48,49] did not respond to NMI
ESXi 8.0
This issue is caused by an NVIDIA device becoming unresponsive.
The PCPU was stuck or taking time accessing the PCI configuration space for the NVIDIA device.
This caused a PSOD when the PCPU on the same physical core failed to handle the TLB Invalidate Request.
Please contact NVIDIA regarding the cause of the NVIDIA device becoming unresponsive.