"Health: Miss counters detected" alerts for Mellanox driver
search cancel

"Health: Miss counters detected" alerts for Mellanox driver

book

Article ID: 383273

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • On ESXi 8.0.2 and 8.0.3, the below alert is reported for the Mellanox driver.

    nmlx5_QueryNicVportContext:188 command failed: IO was aborted

  • ESXi - /var/log/vmkernel.log shows the below entries at random times with error code extSynd 0x0000:

    <NMLX_ERR> nmlx5_core: 0000:45:00.0: Health: Miss counters detected
    <NMLX_INF> synd 0x0: unrecognized error
    <NMLX_INF> extSynd 0x0000
    <NMLX_ERR> nmlx5_QueryNicVportContext:188 command failed: IO was aborted
    <NMLX_ERR> nmlx5_QueryVportCounter:1851 command failed: IO was aborted

  • Uplink status does not update to reflect current uplink state upstream once the error code "extSynd 0x0000" occurs

Environment

vSphere ESXi 8.0.x

Cause

This is a known issue in the nmlx5 health check logic, where the driver incorrectly detects the NIC is in faulty state, even though the NIC firmware is healthy. Driver will then suspend all I/O on the vmnic from the driver side.

Resolution

This issue is resolved in VMware ESXi 8.0U3e (nmlx5_core driver version: 4.23.6.5) and also in the inbox driver for VCF 9.0 (nmlx5_core version: 4.24.0.7).

Reference KB Download Broadcom products and software for guidance on how to navigate and download from the Broadcom download portal.

Workaround:

Currently, there is no workaround to avoid or workaround this condition. Once it occurs, rebooting the ESXi host is needed to recover the uplink. 

Additional Information

If the error code on the ESXi - /var/log/vmkernel.log is extSynd 0x8a02, it indicates that the commands from the driver to the firmware are failing. The issue is at the hardware/firmware layer and it needs to be checked further by the NIC vendor.

<NMLX_ERR> nmlx5_core: 0000:c1:00.0: Health: Miss counters detected
<NMLX_INF> Device internal error state is set
<NMLX_INF> assertVar[0] 0x00000000
<NMLX_INF> assertVar[1] 0x00000000
<NMLX_INF> assertVar[2] 0x00000000
<NMLX_INF> assertVar[3] 0x00000000
<NMLX_INF> assertVar[4] 0x00000000
<NMLX_INF> assertExitPtr 0x20a37df8
<NMLX_INF> assertCallra 0x20a3ebcc
<NMLX_INF> firmwareVersion 0x1a2903e9
<NMLX_INF> hwId 0x00000216
<NMLX_INF> iriscIndex 6
<NMLX_INF> synd 0x1: firmware internal error
<NMLX_INF> extSynd 0x8a02
<NMLX_INF> driver 4.23.6.5
<NMLX_INF> nmlx5_core: 0000:c1:00.0: Health: thread is stopped 0x43199284db88
<NMLX_WRN> nmlx5_core: vmnic1: nmlx5_en_UpdateStatsWork - (nmlx5_core_en_main.c:1882) Device internal error state is set! Stop updating