The PSOD occurs with a backtrace similar to:
0xADDRESS1:[0xADDRESS2]PanicvPanicInt@vmkernel#nover+0x202 stack: 0xADDRESS3
0xADDRESS4:[0xADDRESS5]Panic_NoSave@vmkernel#nover+0x4d stack: 0xADDRESS6
0xADDRESS6:[0xADDRESS7]PCIEDPC_EDRHandler@vmkernel#nover+0x2cd stack: 0x0
0xADDRESS8:[0xADDRESS9]VMKAcpiPciNotifyHandler@vmkernel#nover+0xc8 stack: 0xADDRESS10
0xADDRESS11:[0xADDRESS12]AcpiEvNotifyDispatch@vmkernel#nover+0x3b stack: 0x0
0xADDRESS13:[0xADDRESS14]AcpiOsExecuteWrapper@vmkernel#nover+0x22 stack: 0xADDRESS15
0xADDRESS16:[0xADDRESS17]HelperQueueFunc@vmkernel#nover+0x19d stack: 0xADDRESS18
0xADDRESS19:[0xADDRESS20]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
0xADDRESS21:[0xADDRESS22]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
The PSOD occurs because the ESXi 8.0.2 build does not have the "EsxIoDualDPU" fault survivability service enabled. When a PCIe FATAL_ERROR is detected on the NVIDIA BlueField-2 DPU device, ESXi lacks the capability to handle this hardware error gracefully, resulting in a host crash. The specific error is triggered by an Error Disconnection Recovery (EDR) event received from the PCIe device.
Contact your hardware vendor (NVIDIA and/or server manufacturer):
Share the SEL logs and PSOD details with them
Request assistance with diagnosing the underlying PCIe hardware error
Inquire about possible firmware updates or hardware replacements that might resolve the issue
Collect System Event Log (SEL) from the server's iDRAC to identify the hardware events that precede the PCIe errors:
Log in to the iDRAC web interface
Navigate to Diagnostics
Click on "Export System Event Log (SEL)"
Save the file to your local system
Verify the firmware version of your NVIDIA BlueField-2 DPU and update if needed:
esxcli hardware pci list to locate the deviceWhen using vSphere Lifecycle Manager with DPU-enabled hosts:
Be aware that LCM places the host in maintenance mode and reboots it during remediation
If LCM fails to place the host in maintenance mode, manually power off all VMs and retry the installation
Allow extra time for installation on vSphere Lifecycle Manager-enabled clusters with DPUs due to additional health checks
Important Note: While ESXi 8.0 Update 3e does have the FSS 'EsxIoDualDPU' feature enabled, upgrading to this release alone will not resolve the underlying issue. The EDR (PCIe FATAL_ERROR) is a hardware error originating from the DPU device itself, and will continue to occur regardless of the ESXi version. The EsxIoDualDPU feature provides better error handling but does not address the root hardware problem.
When troubleshooting DPU-related issues, always ensure you're collecting logs from the correct hosts. The TSR logs specifically from the crashed host are required for effective diagnosis.