PSOD occurs on NVIDIA gpu based ESXI host with "PCPU X locked up. Failed to ack TLB invalidate"
search cancel

PSOD occurs on NVIDIA gpu based ESXI host with "PCPU X locked up. Failed to ack TLB invalidate"

book

Article ID: 417686

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

PSOD screen reports information similar to following: 

PCPU X locked up. Failed to ack TLB invalidate (at least 1 locked up, PCPU(s): X).
PCPU(s) did not respond to NMI. Possible hardware problem; contact hardware vendor.


The vmkernel.log records that the NVIDIA device became unresponsive and reset before the PSOD.

YYYY-MM-DDTHH:MM:SS.536Z cpu4:2097455)WARNING: PCI: 740: Dev ####:##:##.1 is unresponsive after reset
YYYY-MM-DDTHH:MM:SS.154Z cpu8:2097387)WARNING: PCI: 740: Dev ####:##:##.2 is unresponsive after reset

In the logEFI.log located at /var/run/log you see following similar backtrace:

1mVMware ESXi 8.0.3 [Releasebuild-24859861 x86_64]
PCPU 48 locked up. Failed to ack TLB invalidate (at least 1 locked up, PCPU(s): 48).
Module(s) involved in panic: [vmkernel Version Releasebuild-24859861]
*PCPU76:2097285/tlbflushcount
PCPU  0: VSVSVVUSVVVVVVVVVSVVVVVSSVSVVVSVSVSSSSVUVSVIVSSSVUIVSVIUSVVIVSVV
PCPU 64: VSVVVVVVSVVSSSSVSVSVVVVVVVVSSVSVSVVSVSVSSSVSSVVSVSIVVISSSVSVUSUS
cpu76:)Code start: VMK uptime: 90:09:45:22.635
cpu76:):[]PanicvPanicInt@vmkernel#nover+0x20c stack: 
cpu76:):[]Panic_NoSave@vmkernel#nover+0x4d stack: 
cpu76:):[]TLBGetLockedCPUBacktraces@vmkernel#nover+0x265 stack: 0x0
cpu76:):[]TLBDoInvalidate@vmkernel#nover+0x240 stack: 
cpu76:):[]TLBFlushCountForceFlush@vmkernel#nover+0xc4 stack: 0x0
cpu76:):[]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0
cpu76:):[]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
cpu76:)base fs=0x0 gs= Kgs=0x0
cpu48:)NMI: 738: NMI IPI: PC , SP  (Src 0x1, CPU48)
cpu76:)Possible hardware problem: 2 PCPU(s) [48,49] did not respond to NMI

Environment

ESXi 8.0

Cause

This issue is caused by an NVIDIA device becoming unresponsive.

The PCPU was stuck or taking time accessing the PCI configuration space for the NVIDIA device.
This caused a PSOD when the PCPU on the same physical core failed to handle the TLB Invalidate Request.

Resolution

Please contact NVIDIA regarding the cause of the NVIDIA device becoming unresponsive.

Additional Information

Understanding a "Failed to ack TLB invalidate" purple diagnostic screen