Network connectivity loss with Mellanox/NVIDIA adapters due to nmlx5 PCI communication errors
search cancel

Network connectivity loss with Mellanox/NVIDIA adapters due to nmlx5 PCI communication errors

book

Article ID: 428766

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • Network connectivity may be lost on ESXi hosts using Mellanox/NVIDIA ConnectX network adapters.
  • One or more vmnics may become unavailable, resulting in virtual machines losing network access and the host reporting uplink redundancy alerts.
  • The issue occurs when the nmlx5 driver encounters PCI communication errors while accessing the network adapter.
  • The adapter may transition from command timeouts to a non-responsive hardware state and stop processing traffic.

The following symptoms may be observed:

  • Virtual machines lose network connectivity or experience high latency
  • ESXi reports Network Uplink Redundancy Lost
  • The affected vmnic appears down or unresponsive
  • Repeated nmlx5 timeout and PCI error messages are logged
  • Network connectivity is not restored without a host power cycle
  • Log entries in /var/run/log/vmkernel* including but not limited to:
    • nmlx5_CoreAccessReg: command failed: Timeout
    • Failed accessing MTMP register
    • Device's health is compromised: PCI COMM error
    • Device internal error state is set
    • IO was aborted

Environment

VMware vSphere ESXi

Cause

This issue is caused by a hardware-level failure affecting communication between the ESXi host and the network adapter over the PCIe bus.

A PCI COMM error indicates that the adapter has entered an internal error state and is no longer able to respond to PCIe transactions.
Once this condition occurs, the nmlx5 driver cannot successfully access device registers or recover normal adapter operation.

Common causes include:

  • Defective network adapter hardware
  • Faulty PCIe slot, riser, or system board
  • Firmware deadlock on the network adapter
  • PCIe signal or power instability

Resolution

  1. Power off the ESXi host and perform a full cold power cycle.
  2. Power on the host and verify the status of the affected vmnic with "esxcli network nic get -n vmnicX"
  3. Verify that the installed NIC firmware and nmlx5 driver versions are supported according to the VMware Compatibility Guide (VCG).
  4. Involve system vendor to run hardware diagnostics provided by the system vendor to validate adapter and PCIe health.
  5. If the issue persists after a cold boot, replace the network adapter or move it to a different PCIe slot.

Additional Information

To validate the driver/firmware version of the network adapters, please refer article: Determining Network/Storage firmware and driver version in ESXi