"PCIe FATAL_ERROR: PCI bus error, undiagnosed" - ESXi 8.0.2 Host Crashes with PSOD When Using NVIDIA BlueField-2 DPUs

Products

VMware vSphere ESXi

Issue/Introduction

ESXi 8.0.2 hosts with NVIDIA BlueField-2 DPU devices may experience intermittent Purple Screen of Death (PSOD) crashes when under load. The PSOD displays the error message: "0000:0c:01.0: PCI bus error, undiagnosed. This may be a hardware problem; please contact your hardware vendor."
Affected hosts function normally when in maintenance mode or without workload, suggesting the issue is triggered by activity on the DPU devices.

The PSOD occurs with a backtrace similar to:

0xADDRESS1:[0xADDRESS2]PanicvPanicInt@vmkernel#nover+0x202 stack: 0xADDRESS3
0xADDRESS4:[0xADDRESS5]Panic_NoSave@vmkernel#nover+0x4d stack: 0xADDRESS6
0xADDRESS6:[0xADDRESS7]PCIEDPC_EDRHandler@vmkernel#nover+0x2cd stack: 0x0
0xADDRESS8:[0xADDRESS9]VMKAcpiPciNotifyHandler@vmkernel#nover+0xc8 stack: 0xADDRESS10
0xADDRESS11:[0xADDRESS12]AcpiEvNotifyDispatch@vmkernel#nover+0x3b stack: 0x0
0xADDRESS13:[0xADDRESS14]AcpiOsExecuteWrapper@vmkernel#nover+0x22 stack: 0xADDRESS15
0xADDRESS16:[0xADDRESS17]HelperQueueFunc@vmkernel#nover+0x19d stack: 0xADDRESS18
0xADDRESS19:[0xADDRESS20]CpuSched_StartWorld@vmkernel#nover+0xe2 stack: 0x0
0xADDRESS21:[0xADDRESS22]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

Environment

VMware ESXi 8.0.2 build 23305546
NVIDIA BlueField-2 DPU devices
The error occurs on servers with active workloads

Cause

The PSOD occurs because the ESXi 8.0.2 build does not have the "EsxIoDualDPU" fault survivability service enabled. When a PCIe FATAL_ERROR is detected on the NVIDIA BlueField-2 DPU device, ESXi lacks the capability to handle this hardware error gracefully, resulting in a host crash. The specific error is triggered by an Error Disconnection Recovery (EDR) event received from the PCIe device.

Resolution

Contact your hardware vendor (NVIDIA and/or server manufacturer):
1. Share the SEL logs and PSOD details with them
2. Request assistance with diagnosing the underlying PCIe hardware error
3. Inquire about possible firmware updates or hardware replacements that might resolve the issue
Collect System Event Log (SEL) from the server's iDRAC to identify the hardware events that precede the PCIe errors:
1. Log in to the iDRAC web interface
2. Navigate to Diagnostics
3. Click on "Export System Event Log (SEL)"
4. Save the file to your local system
Verify the firmware version of your NVIDIA BlueField-2 DPU and update if needed:
1. Check current firmware using: esxcli hardware pci list to locate the device
2. Compare against latest firmware from NVIDIA (version 24.36.7506 or newer recommended)
3. Apply firmware updates if available
Implement workarounds while waiting for a permanent fix:
1. If possible, reduce the workload on affected hosts to prevent triggering the issue
2. Consider upgrading to newer ESXi versions that support the EsxIoDualDPU fault survivability service
3. Consult with VMware support about the availability of patches for ESXi 8.0.2
When using vSphere Lifecycle Manager with DPU-enabled hosts:
1. Be aware that LCM places the host in maintenance mode and reboots it during remediation
2. If LCM fails to place the host in maintenance mode, manually power off all VMs and retry the installation
3. Allow extra time for installation on vSphere Lifecycle Manager-enabled clusters with DPUs due to additional health checks

Important Note: While ESXi 8.0 Update 3e does have the FSS 'EsxIoDualDPU' feature enabled, upgrading to this release alone will not resolve the underlying issue. The EDR (PCIe FATAL_ERROR) is a hardware error originating from the DPU device itself, and will continue to occur regardless of the ESXi version. The EsxIoDualDPU feature provides better error handling but does not address the root hardware problem.

Additional Information

For DPU configuration with NSX and vSphere Lifecycle Manager.
When troubleshooting DPU-related issues, always ensure you're collecting logs from the correct hosts. The TSR logs specifically from the crashed host are required for effective diagnosis.