Symptoms:
Traffic traversing redirect rules and through the service VM periodically stops working. This can't be diagnosed using the logs. Below are some ways to confirm this condition:
1. Run net-stats -A -t WwQqihVv > /<path>/<filename.txt>
a. Search the output for the service VM name and the vnic (probably eth1 but might not be) that is connected to the service overlay segment.
b. The sections will look similar to the below.
{"name": "PaloAltoNetworks_PA-VM-NST_DepSpec (180).eth1", "switch": "DvsPortset-1", "id": 67108901, "mac": "xx:xx:xx:xx:xx:xx", "rxmode": 0, "tunemode": 0, "uplink": "false", "ens": "false", "promisc": "false", "sink": "false" , "txpps": 131644, "txmbps": 1303.2, "txsize": 1237, "txeps": 0.00, "rxpps": 131681, "rxmbps": 1303.5, "rxsize": 1237, "rxeps": 0.00, "vnic": { "type": "vmxnet3", "ring1sz": 1024, "ring2sz": 1024, "tsopct": 0.0, "tsotputpct": 0.0, "txucastpct": 100.0, "txeps": 0.0, "lropct": 0.0, "lrotputpct": 0.0, "rxucastpct": 100.0, "rxeps": 0.0, "maxqueuelen": 0, "requeuecnt": 0.0, "agingdrpcnt": 0.0, "deliveredByBurstQ": 0.0, "dropsByBurstQ": 0.0, "droppedbyQueuing": 0.0 , "txdisc": 0.0, "qstop": 0.0, "txallocerr": 0.0, "txtsosplit": 0.0, "r1full": 0.0, "r2full": 0.0, "sgerr": 0.0}, "rxqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 7, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131674, "mbps": 1303.5, "errs": 0.0} ]}, "txqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 0, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131646, "mbps": 1303.2, "errs": 0.0} ]},
c. You can see in the above 2 tx queues one is showing 131646 pps of traffic the other shows 0.
2. Run the below vsish command. A difference of 1 between next2Tx and next2Comp shows the issue.
a. vsish -e get /net/portsets/DvsPortset-<X>/ports/<switchport number>/vmxnet3/txqueues/<queue number>/status
i. Example:
Tx Hang Issue is Present.
vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/1/status
status of a vmxnet3 vNIC tx queue {
intr index:0
stopped:0
error code:0
next2Tx:787 Note here the issue exists. There is a value difference of 1 between next2Tx and next2Comp thus the issue is present.
next2Comp:788
genCount:348131
next2Write:788
next2Tx from timeout:980
next2Comp from timeout:788
timestamp in milliseconds in check:384765941
}
Tx Hang Issue is Not Present.
vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/0/status
status of a vmxnet3 vNIC tx queue {
intr index:0
stopped:0
error code:0
next2Tx:663 Note here the values are equal. The TX Hang issue is not present.
next2Comp:663
genCount:780117
next2Write:663
next2Tx from timeout:598
next2Comp from timeout:597
timestamp in milliseconds in check:0
}
In a working scenario, the SPF port code calls an ESXi function to forward packets from the Guest VM if the Guest VM port is active. In this case, due to the Guest VM port undergoing a reset caused by a snapshot of the VM, the ESXi hypervisor is unable to process the packets being sent from the Guest VM vNic port. This behavior results in the I/O completion of the packet to get missed and the hypervisor proceeds to free the packet which in turn, is causing the Tx queue to hang and no longer process traffic.
The issue is resolved in the following ESXI releases:
ESXi 7.0.3P08/7.0 U3o
ESXi 8.0.1P02/8.0 U1c
Workaround:
Once a Tx queue is hung, the datapath flowing through that queue will be broken until the vNic of where the hung queue sits, is reset (disconnect and reconnect in the VM settings).