"Health: Miss counters detected" alerts for Mellanox driver on ESXi 8.0.2 and 8.0.3
search cancel

"Health: Miss counters detected" alerts for Mellanox driver on ESXi 8.0.2 and 8.0.3

book

Article ID: 383273

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

On ESXi 8.0.2 and 8.0.3, you see the below alert reported for the Mellanox driver:

"nmlx5_QueryNicVportContext:188 command failed: IO was aborted"

The vmkernel.log shows the below entries at random times with error code extSynd 0x0000 :

<NMLX_ERR> nmlx5_core: 0000:45:00.0: Health: Miss counters detected.
<NMLX_INF> synd 0x0: unrecognized error
<NMLX_INF> extSynd 0x0000
<NMLX_ERR> nmlx5_QueryNicVportContext:188 command failed: IO was aborted
<NMLX_ERR> nmlx5_QueryVportCounter:1851 command failed: IO was aborted

Environment

  • VMware vSphere ESXi 8.x

Cause

This is known bug in the nmlx5 health mechanism logic where the driver incorrectly detects NIC is in faulty state.

Resolution

Fixed in ESXi 8.0 patch 05 (nmlx5 version: 4.23.6.5) and also in the inbox driver for VCF 9.0 (nmlx5 version: 4.24.0.7).

Additional Information

 If the error code in the vmkernel.log is extSynd 0x8a02, it indicates that the commands from driver to firmware are failing. The issue is at the hardware/firmware layer and it needs to be checked further by the NIC vendor.

<NMLX_ERR> nmlx5_core: 0000:c1:00.0: Health: Miss counters detected
<NMLX_INF> Device internal error state is set
<NMLX_INF> assertVar[0] 0x00000000
<NMLX_INF> assertVar[1] 0x00000000
<NMLX_INF> assertVar[2] 0x00000000
<NMLX_INF> assertVar[3] 0x00000000
<NMLX_INF> assertVar[4] 0x00000000
<NMLX_INF> assertExitPtr 0x20a37df8
<NMLX_INF> assertCallra 0x20a3ebcc
<NMLX_INF> firmwareVersion 0x1a2903e9
<NMLX_INF> hwId 0x00000216
<NMLX_INF> iriscIndex 6
<NMLX_INF> synd 0x1: firmware internal error
<NMLX_INF> extSynd 0x8a02
<NMLX_INF> driver 4.23.6.5
<NMLX_INF> nmlx5_core: 0000:c1:00.0: Health: thread is stopped 0x43199284db88
<NMLX_WRN> nmlx5_core: vmnic1: nmlx5_en_UpdateStatsWork - (nmlx5_core_en_main.c:1882) Device internal error state is set! Stop updating