"Health: Miss counters detected" alerts for Mellanox driver
search cancel

"Health: Miss counters detected" alerts for Mellanox driver

book

Article ID: 383273

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • On ESXi 8.0.2 and 8.0.3, you see the below alert reported for the Mellanox driver:
    nmlx5_QueryNicVportContext:188 command failed: IO was aborted
  • The vmkernel.log shows the below entries at random times with error code extSynd 0x0000:

<NMLX_ERR> nmlx5_core: 0000:45:00.0: Health: Miss counters detected.
<NMLX_INF> synd 0x0: unrecognized error
<NMLX_INF> extSynd 0x0000
<NMLX_ERR> nmlx5_QueryNicVportContext:188 command failed: IO was aborted
<NMLX_ERR> nmlx5_QueryVportCounter:1851 command failed: IO was aborted

  • Uplink status does not update to reflect current uplink state upstream once the error code "extSynd 0x0000" occurs

Environment

8.0.2x

Cause

  • This is a known issue in the nmlx5 health check logic, where the driver incorrectly detects NIC is in faulty state, even though the NIC firmware is healthy. Driver will then suspend all I/O on the vmnic from the driver side.

Resolution

This issue is resolved in VMware ESXi 8.0U3e (nmlx5_core driver version: 4.23.6.5), and also in the inbox driver for VCF 9.0 (nmlx5_core version: 4.24.0.7).

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Currently, there is no workaround to avoid or workaround this condition. Once it occurs, reboot the the ESXi host is needed to recover the uplink. 

Additional Information

 If the error code in the vmkernel.log is extSynd 0x8a02, it indicates that the commands from driver to firmware are failing. The issue is at the hardware/firmware layer and it needs to be checked further by the NIC vendor.

<NMLX_ERR> nmlx5_core: 0000:c1:00.0: Health: Miss counters detected
<NMLX_INF> Device internal error state is set
<NMLX_INF> assertVar[0] 0x00000000
<NMLX_INF> assertVar[1] 0x00000000
<NMLX_INF> assertVar[2] 0x00000000
<NMLX_INF> assertVar[3] 0x00000000
<NMLX_INF> assertVar[4] 0x00000000
<NMLX_INF> assertExitPtr 0x20a37df8
<NMLX_INF> assertCallra 0x20a3ebcc
<NMLX_INF> firmwareVersion 0x1a2903e9
<NMLX_INF> hwId 0x00000216
<NMLX_INF> iriscIndex 6
<NMLX_INF> synd 0x1: firmware internal error
<NMLX_INF> extSynd 0x8a02
<NMLX_INF> driver 4.23.6.5
<NMLX_INF> nmlx5_core: 0000:c1:00.0: Health: thread is stopped 0x43199284db88
<NMLX_WRN> nmlx5_core: vmnic1: nmlx5_en_UpdateStatsWork - (nmlx5_core_en_main.c:1882) Device internal error state is set! Stop updating