An ESXi host running virtual desktops using multiple NVIDIA L40 or L4 GPUs experiences a PSOD (purple screen of death).
The following entries may exist in ESXi- /var/run/log/LogEFI.log:
2024-05-07T06:06:39.441Z In(14) LogEFI[2101148]: 0000:01:00.0: IOMMU Fault detected for (vmgfx3/nvidia) IOaddr: 0xf0ecee080 Mask: 0x2 Domain: 0x43102925d3e0.
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1bda0:[0x420023119b5a]PanicvPanicInt@vmkernel#nover+0x202 stack: 0x453ba4e1bdfc
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1be50:[0x42002311a1f8]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453ba4e1beb0
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1beb0:[0x4200230fc022]IOMMUProcessFaults@vmkernel#nover+0x313 stack: 0x2
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1bf60:[0x4200230f2fbc]HelperQueueFunc@vmkernel#nover+0x19d stack: 0x430e40e01238
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1bfe0:[0x42002342c015]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
2024-05-07T05:53:06.914Z In(14) LogEFI: cpu2:2098332)0x453ba4e1c000:[0x4200230dbdff]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2024-05-07T05:53:06.919Z In(14) LogEFI: cpu2:2098332)base fs=0x0 gs=0x420040800000 Kgs=0x0
There may also be similar entries in ESXi- /var/run/log/vmkernel.log:
2024-05-07T05:16:01.026Z In(182) vmkernel: cpu205:2104025)NVRM: Xid (PCI:0000:81:00): 119, pid=2104021, name=, Timeout after 4s of waiting for RPC response from GPU4 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20804006 0x208).
2024-05-07T05:16:15.587Z In(182) vmkernel: cpu206:2104025)NVRM: Xid (PCI:0000:81:00): 119, pid=2104021, name=, Timeout after 4s of waiting for RPC response from GPU4 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080016d 0xc).
2024-05-07T05:16:22.092Z In(182) vmkernel: cpu206:2104025)NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:81:00 (printing 1 of every 30). The GPU likely needs to be reset.
The Guest & GPU can trigger a DMA outside of its IOMMU domain.
This is likely a hardware issue and can be attributed to out-of-date firmware and/or drivers for the NVIDIA GPU cards. Contact NVIDIA for further assistance.
Further information on XID errors and other pertinent information for this issue from NVIDIA:
NVIDIA Native Drivers for VMware ESXi Inbox Drivers Release Notes
Timeout waiting for RPC from GSP!
NVIDIA driver on ESX 6.5 causing PSOD
XID Errors