Symptoms :
1. Virtual Machines(VMs) suddenly lose connectivity to all or some network destinations. Pings to those addresses fail.
2.During this operation, the vmxnet3 vNIC generates a message about “hang detected" in the ESXi kernel logs, similar to the following:
"Vmxnet3: 21228: vmname,00:50:56:11:00:00, portID(1341010101): Hang detected,numHangQ: 4, enableGen: 9218"
3. Additionally, the following errors were logged in vmkernel.logs -
"WARNING: Uplink: 21014: Queue 0 of device vmnicX stuck, resetting the device"
4. Connectivity is restored by migrating the network of impacted VMs to another VMnic on the same/different host)
5. Flapping the VMnic link UP/Down does not help.
VMware vSphere ESXi 7.0.x
Ntg3 driver version - 4.1.9.0
It appears the issue (TX hang) is caused by a rare data race in ntg3 driver between Ntg3XmitPktList and Ntg3TxCompletion.
It requires Ntg3TxCompletion to mark the completion of the entire TXQ (e.g. from almost full to empty) within a very narrow window of Ntg3XmitPktList when it finds that the TXQ is full.
The fix will be included in the next latest ntg3 driver version.
Workaround :
Hardware vendor provided bootleg version with the fix.
The bootleg driver adds a line in Ntg3NetPollCB such that if the TXQ is empty but paused, it would also call Ntg3TxCompletion so that the TXQ could restart.