ESXi hosts "not responding" due to bnxtnet driver "hot reset"
search cancel

ESXi hosts "not responding" due to bnxtnet driver "hot reset"

book

Article ID: 435236

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VMware ESXi hosts enter a "Not Responding" state in vCenter Server.
  • An examination of vobd.log and vmkernel.log indicates a concurrent link-down failure across all vmnics. This triggers a bnxtnet driver 'hot reset,' causing a outage for both Management and Workload network traffic.
  • Key log identifiers reported on ESXi host:
    • /var/run/log/vobd.log: [vob.net.vmnic.linkstate.down] vmnic vmnicX linkstate down and Failed criteria: 128.

    • /var/run/log/vmkernel.log: bnxtnet: bnxtnet_handle_hot_reset: <Task_ID> Hot reset async evt: evt_data1 = 0x301.

    • /var/run/log/vmkernel.log: bnxtnet_hot_reset_task: <Task_ID> Waiting for 1300 msecs for firmware reset completion.

Environment

VMware vSphere ESXi

Cause

  • The bnxtnet driver initiates a hot reset after receiving an asynchronous reset event from the NIC firmware. This is evidenced in the vmkernel.log by the following sequence:
    • bnxtnet_handle_hot_reset:3929: [vmnicX : <Task_ID>] Hot reset async evt: evt_data1 = 0x301, evt_data2 = 0xda21.
  • This event triggers the driver to enter a recovery state:
    • bnxtnet_hot_reset_task:3776: [vmnicX : <Task_ID> ] Waiting for 1300 msecs for firmware reset completion.
  • The root cause is a mismatch or improper configuration of the Forward Error Correction (FEC) mode on the physical switch ports.
  • Note on FEC Mode: Forward Error Correction (FEC) is a digital signal processing technique used to enhance data reliability. It introduces redundant data (error-correcting code) to a transmission, allowing the receiver to detect and fix errors without needing a retransmission. In high-speed networking (25G/100G), if the FEC settings (such as Clause 74, Clause 91, or RS-FEC) are not synchronized between the NIC and the Switch, the physical link may flap or the firmware may trigger a reset to attempt re-synchronization.

 

Resolution

Engage the hardware vendor to investigate the NIC firmware behavior and validate the following points:

    1. Switch Configuration: Review the Forward Error Correction (FEC) settings on the physical switch ports connected to the impacted ESXi hosts.

    2. FEC Alignment: Ensure the FEC clause/mode on the physical switch matches the requirements of the SFP+ modules and NIC firmware. Refer Broadcom KB article Confirming FEC CL74 settings for Broadcom BCM57414 adapters in ESXi 8.0 for more information. 

    3. Disable FEC: If the hardware vendor identifies a mismatch, disabling or correctly setting the FEC mode on the physical switch side has been shown to stabilize the link and prevent firmware-initiated hot resets.

    4. Hardware Vendor Engagement: Provide the bnxtnet_handle_hot_reset log snippets to your hardware vendor to determine why the NIC firmware issued the reset command (evt_data1 = 0x301).

    5. Temporary Recovery: A reboot of the ESXi host may temporarily restore connectivity if the driver fails to recover from the hot reset automatically.

 

Additional Information

Please also refer below Broadcom KB articles for known symptoms :