Host shows error "DPU experience a failure or DPU removed"
book
Article ID: 404256
calendar_today
Updated On:
Products
VMware vSphere ESXi
VMware vSphere ESXi 8.0
Issue/Introduction
- On hosts configured with a NVIDIA BlueField-2 DPU card(s), an alarm shows for an ESXi host with the message
DPU experience a failure or DPU removed
- When viewing DPUs, vmnics are no longer shows under the specific DPU and shows the message
vmdpu# has encountered an error.
- DPU will continue to be down until host is rebooted
- Logs similar to the below will be in vodb logs shows vmnic down with Failed criteria 128:
YYYY-MM-DDT01:23:24.574Z In(14) vobd[2098148]: [netCorrelator] 68932599us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic# is down. Affected dvPort: ########/## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##. 2 uplinks up. Failed criteria: 128
- Logs similar to the below will be in vmkernel logs:
YYYY-MM-DDT00:51:58.011Z In(182) vmkernel: cpu70:2105887)<NMLX_ERR> nmlx5_core: 0000:2a:00.0: Health: PCI error detected
YYYY-MM-DDT00:51:58.011Z In(182) vmkernel: cpu70:2105887)<NMLX_INF> Device internal error state is set
YYYY-MM-DDT00:51:58.012Z In(182) vmkernel: cpu25:2106817)<NMLX_INF> nmlx5_core: 0000:2a:00.1: Health: thread is stopped 0x43232c04e448
Cause
- A high amount of hardware interrupts are coming from the eMMC card on the DPU.
- ESXi is not able to handle a high amount of HW interrupts gracefully, causing a crash of the DPU card, which results in a DPU failure state
Resolution
This is a known issue resolved in ESXi 9.0 and 8.0.3 P06
Feedback
thumb_up
Yes
thumb_down
No