VM network connectivity loss on ESXi hosts using bnxtnet drivers with FIFO network scheduler

search cancel

VM network connectivity loss on ESXi hosts using bnxtnet drivers with FIFO network scheduler

book

Article ID: 373273

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

VMs lose network connectivity until one of the workarounds are performed.
TX ring full stats for the nic will show high numbers of max tx ring full drops present for the relevant nic.

   NIC Private statistics:
       port tx_pause_frames: 0
       port rx_pause_frames: 0
       tx timeout count: 0
       helper drop count: 0
      max tx ring full: 637 <------

This can be obtained on a live machine by running the following command: /usr/lib/vmware/vm-support/bin/nicinfo.sh | grep 'NIC statistics\|ring full'

Example output (exact format may differ depending on vendor)

/usr/lib/vmware/vm-support/bin/nicinfo.sh | grep 'NIC statistics\|ring full'
NIC statistics for vmnic0:
[tq0] ring full: XXX
NIC statistics for vmnic1:
[tq0] ring full: XXX

It can be seen within a generated esxi log bundle at /commands/nicinfo.sh .

The following hang logs from vmxnet3 backend show in the vmkernel.log of the host of the impacted VM.

2023-04-25T02:12:41.330Z cpu42:1059839 opID=26c3ada6)Vmxnet3: 21226: kceap66,00:50:56:11:22:33, portID(100663325): Hang detected,numHangQ: 3, enableGen: 14
2023-04-25T02:12:41.330Z cpu42:1059839 opID=26c3ada6)Vmxnet3: 21235: portID:100663325, QID: 0, next2TX: 324, next2Comp: 326, lastNext2TX: 324, next2Write:236, ringSize: 512 inFlight: 267, delay(ms): 31968,txStopped: 0
The impacted VMs uplink NIC is using the default FIFO network scheduler. This can be checked using the vsish command (vsish -e cat /net/pNics/<vmnic>/sched/info) with the vmnic path amended to the relevant vmNICs ID, run from the esxi host of the impacted VM. The below is an example of the command for a NIC that is set to the FIFO scheduler and could be impacted:

vsish -e cat /net/pNics/vmnic6/sched/info
sched {
name:fifo

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

There is an issue in bnxtnet async driver, which would miss setting txq status as stop in order to inform upper layer when tx ring full. This would lead to a potential race between vNIC queue and the FIFO scheduler under certain circumstances.

Due to the race would cause vmxnet3 backend TX Hang, that leads to VM network connectivity loss.

Resolution

Resolution:

Upgrade bnxtnet async driver to 227.0.134.0 or higher for a long term fix allowing the FIFO scheduler to be used.

Workaround:

The following actions can temporarily clear the hang state for an individual VM however if the original trigger scenario is present it may reoccur :
- Suspend and resume the virtual machine.
- Reboot the virtual machine.
- Disconnect and reconnect the vNIC of the VM.
The NIC scheduler can be changed to HCLK while on the impacted driver version as a short term work around, this will prevent the issue re-occurring however the above temporary workaround may be needed to clear the original hang state of a VM. This can be changing by following the steps within the vSphere Networking Best Practice guide (Under the section Network I/O Control Advanced Performance Options) via CLI per host, alternatively the DVS consuming the impacted pNICS can be set to have NIOC enabled, this will also change the scheduler to HCLK and prevent the issue from occurring.

Feedback

thumb_up Yes

thumb_down No