VMs lose network connectivity on ESXi hosts with Ntg3 driver(4.1.9.0) due to TX hang between Ntg3XmitPktList and Ntg3TxCompletion.
search cancel

VMs lose network connectivity on ESXi hosts with Ntg3 driver(4.1.9.0) due to TX hang between Ntg3XmitPktList and Ntg3TxCompletion.

book

Article ID: 370372

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

1. Virtual Machines(VMs) suddenly lose connectivity to all or some network destinations. Pings to those addresses fail.
2.During this operation, the vmxnet3 vNIC generates a message about “hang detected" in the ESXi kernel logs, similar to the following:
"Vmxnet3: 21228: vmname,00:50:56:11:00:00, portID(1341010101): Hang detected,numHangQ: 4, enableGen: 9218"
3. Additionally, the following errors were logged in vmkernel.logs -
"WARNING: Uplink: 21014: Queue 0 of device vmnicX stuck, resetting the device"
4. Connectivity is restored by migrating the network of impacted VMs to another VMnic on the same/different host)
5. Flapping the VMnic link UP/Down does not help.

Environment

VMware vSphere ESXi 7.0.x 
Ntg3 driver version - 4.1.9.0

Cause

It appears the issue (TX hang) is caused by a rare data race in ntg3 driver between Ntg3XmitPktList and Ntg3TxCompletion. 
It requires Ntg3TxCompletion to mark the completion of the entire TXQ (e.g. from almost full to empty) within a very narrow window of Ntg3XmitPktList when it finds that the TXQ is full.

Resolution

The fix will be included in the next latest ntg3 driver version.

Additional Information

Workaround :
Hardware vendor provided bootleg version with the fix.
The bootleg driver adds a line in Ntg3NetPollCB such that if the TXQ is empty but paused, it would also call Ntg3TxCompletion so that the TXQ could restart.