Mellanox driver "Health: Miss counters detected" alerts on VMware vSphere ESXi
search cancel

Mellanox driver "Health: Miss counters detected" alerts on VMware vSphere ESXi

book

Article ID: 383273

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

On ESXi 8.0.2 and 8.0.3, the following symptoms occur due to a Mellanox driver timing issue:

  • The vSphere Client reports the alert: nmlx5_QueryNicVportContext:188 command failed: IO was aborted.

  • Uplinks (vmnics) may disappear from the ESXi host or fail to update their status to reflect the current upstream state.

  • The ESXi host may experience a Purple Screen of Death (PSOD).

  • The /var/run/log/vmkernel.log file contains the following entries at random times:

    <NMLX_ERR> nmlx5_core: ####:##:##.#: Health: Miss counters detected
    <NMLX_INF> synd 0x0: unrecognized error
    <NMLX_INF> extSynd 0x0000
    <NMLX_ERR> nmlx5_QueryNicVportContext:188 command failed: IO was aborted
    <NMLX_ERR> nmlx5_QueryVportCounter:1851 command failed: IO was aborted

Environment

  • vSphere ESXi 8.0.2

  • vSphere ESXi 8.0.3

Cause

This is a know issue caused by a timing-related bug in the nmlx5 health mechanism logic. The driver health thread does not wait long enough to check the device health state updated by the firmware. If consecutive checks occur too rapidly during a delayed firmware update, the driver incorrectly detects the NIC is in a faulty state (false positive) and suspends all I/O on the affected vmnic.

Resolution

Fixed in release ESXi 8.0 Update 3e (build 24674464) and higher. This release includes nmlx5_core driver version 4.23.6.5. The fix is also included in the inbox driver for VCF 9.0 (nmlx5_core version 4.24.0.7).

See Download Broadcom products and software for steps to download these releases.

Note: There is no driver-level configuration workaround to avoid this condition. Once the error occurs, reboot the ESXi host to recover the uplink.

Additional Information

If the error code on the ESXi - /var/run/log/vmkernel.log is extSynd 0x8a02, it indicates that the commands from the driver to the firmware are failing. The issue is at the hardware/firmware layer and it needs to be checked further by the NIC vendor.

<NMLX_ERR> nmlx5_core: ####:##:##.#: Health: Miss counters detected
<NMLX_INF> Device internal error state is set
<NMLX_INF> assertVar[0] 0x00000000
<NMLX_INF> assertVar[1] 0x00000000
<NMLX_INF> assertVar[2] 0x00000000
<NMLX_INF> assertVar[3] 0x00000000
<NMLX_INF> assertVar[4] 0x00000000
<NMLX_INF> assertExitPtr 0x20a37df8
<NMLX_INF> assertCallra 0x20a3ebcc
<NMLX_INF> firmwareVersion 0x1a2903e9
<NMLX_INF> hwId 0x00000216
<NMLX_INF> iriscIndex 6
<NMLX_INF> synd 0x1: firmware internal error
<NMLX_INF> extSynd 0x8a02
<NMLX_INF> driver 4.23.6.5
<NMLX_INF> nmlx5_core: ####:##:##.#: Health: thread is stopped 0x43199284db88
<NMLX_WRN> nmlx5_core: vmnic1: nmlx5_en_UpdateStatsWork - (nmlx5_core_en_main.c:1882) Device internal error state is set! Stop updating

Japanese KB: 「Health: Miss counters detected」Mellanoxドライバーのアラートについて