On ESXi 8.0.2 and 8.0.3, you see the below alert reported for the Mellanox driver:
"nmlx5_QueryNicVportContext:188 command failed: IO was aborted"
The vmkernel.log shows the below entries at random times with error code extSynd 0x0000 :
<NMLX_ERR> nmlx5_core: 0000:45:00.0: Health: Miss counters detected.
<NMLX_INF> synd 0x0: unrecognized error
<NMLX_INF> extSynd 0x0000
<NMLX_ERR> nmlx5_QueryNicVportContext:188 command failed: IO was aborted
<NMLX_ERR> nmlx5_QueryVportCounter:1851 command failed: IO was aborted
This is known bug in the nmlx5 health mechanism logic where the driver incorrectly detects NIC is in faulty state.
Fixed in ESXi 8.0 patch 05 (nmlx5 version: 4.23.6.5) and also in the inbox driver for VCF 9.0 (nmlx5 version: 4.24.0.7).
If the error code in the vmkernel.log is extSynd 0x8a02, it indicates that the commands from driver to firmware are failing. The issue is at the hardware/firmware layer and it needs to be checked further by the NIC vendor.
<NMLX_ERR> nmlx5_core: 0000:c1:00.0: Health: Miss counters detected
<NMLX_INF> Device internal error state is set
<NMLX_INF> assertVar[0] 0x00000000
<NMLX_INF> assertVar[1] 0x00000000
<NMLX_INF> assertVar[2] 0x00000000
<NMLX_INF> assertVar[3] 0x00000000
<NMLX_INF> assertVar[4] 0x00000000
<NMLX_INF> assertExitPtr 0x20a37df8
<NMLX_INF> assertCallra 0x20a3ebcc
<NMLX_INF> firmwareVersion 0x1a2903e9
<NMLX_INF> hwId 0x00000216
<NMLX_INF> iriscIndex 6
<NMLX_INF> synd 0x1: firmware internal error
<NMLX_INF> extSynd 0x8a02
<NMLX_INF> driver 4.23.6.5
<NMLX_INF> nmlx5_core: 0000:c1:00.0: Health: thread is stopped 0x43199284db88
<NMLX_WRN> nmlx5_core: vmnic1: nmlx5_en_UpdateStatsWork - (nmlx5_core_en_main.c:1882) Device internal error state is set! Stop updating