Host shows error "DPU experience a failure or DPU removed"
search cancel

Host shows error "DPU experience a failure or DPU removed"

book

Article ID: 404256

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESXi 8.0

Issue/Introduction

  • On hosts configured with a NVIDIA BlueField-2 DPU card(s), an alarm shows for an ESXi host with the message

DPU experience a failure or DPU removed

  • When viewing DPUs, vmnics are no longer shows under the specific DPU and shows the message

vmdpu# has encountered an error.

  • DPU will continue to be down until host is rebooted
  • Logs similar to the below will be in vodb logs shows vmnic down with Failed criteria 128:

YYYY-MM-DDT01:23:24.574Z In(14) vobd[2098148]:  [netCorrelator] 68932599us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic# is down. Affected dvPort: ########/## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##. 2 uplinks up. Failed criteria: 128

  • Logs similar to the below will be in vmkernel logs:

YYYY-MM-DDT00:51:58.011Z In(182) vmkernel: cpu70:2105887)<NMLX_ERR> nmlx5_core: 0000:2a:00.0: Health: PCI error detected
YYYY-MM-DDT00:51:58.011Z In(182) vmkernel: cpu70:2105887)<NMLX_INF> Device internal error state is set
YYYY-MM-DDT00:51:58.012Z In(182) vmkernel: cpu25:2106817)<NMLX_INF> nmlx5_core: 0000:2a:00.1: Health: thread is stopped 0x43232c04e448

 

Environment

VMware vSphere ESXi 8.0

Cause

  • A high amount of hardware interrupts are coming from the eMMC card on the DPU.
  • ESXi is not able to handle a high amount of HW interrupts gracefully, causing a crash of the DPU card, which results in a DPU failure state

Resolution

This is a known issue resolved in ESXi 9.0 and 8.0.3 P06