Intermittent connectivity lost from outside of NSX-T to destination VM behind Palo Alto Service Insertion VM

Products

VMware NSX

Issue/Introduction

Symptoms:

Traffic traversing redirect rules and through the service VM periodically stops working. This can't be diagnosed using the logs. Below are some ways to confirm this condition:

1. Run net-stats -A -t WwQqihVv > /<path>/<filename.txt>

a. Search the output for the service VM name and the vnic (probably eth1 but might not be) that is connected to the service overlay segment.

b. The sections will look similar to the below.

  {"name": "PaloAltoNetworks_PA-VM-NST_DepSpec (180).eth1", "switch": "DvsPortset-1", "id": 67108901, "mac": "xx:xx:xx:xx:xx:xx", "rxmode": 0, "tunemode": 0, "uplink": "false", "ens": "false", "promisc": "false", "sink": "false" ,
    "txpps": 131644, "txmbps": 1303.2, "txsize": 1237, "txeps": 0.00, "rxpps": 131681, "rxmbps": 1303.5, "rxsize": 1237, "rxeps": 0.00,
    "vnic": { "type": "vmxnet3", "ring1sz": 1024, "ring2sz": 1024, "tsopct": 0.0, "tsotputpct": 0.0, "txucastpct": 100.0, "txeps": 0.0, 
      "lropct": 0.0, "lrotputpct": 0.0, "rxucastpct": 100.0, "rxeps": 0.0,
      "maxqueuelen": 0, "requeuecnt": 0.0, "agingdrpcnt": 0.0, "deliveredByBurstQ": 0.0, "dropsByBurstQ": 0.0, "droppedbyQueuing": 0.0 ,
      "txdisc": 0.0, "qstop": 0.0, "txallocerr": 0.0, "txtsosplit": 0.0, "r1full": 0.0, "r2full": 0.0, "sgerr": 0.0},
    "rxqueue": { "count": 2, "details": [
      {"intridx": 0, "pps": 7, "mbps": 0.0, "errs": 0.0},
      {"intridx": 0, "pps": 131674, "mbps": 1303.5, "errs": 0.0} ]},
    "txqueue": { "count": 2, "details": [
      {"intridx": 0, "pps": 0, "mbps": 0.0, "errs": 0.0},
      {"intridx": 0, "pps": 131646, "mbps": 1303.2, "errs": 0.0} ]},

c. You can see in the above 2 tx queues one is showing 131646 pps of traffic the other shows 0.

2. Run the below vsish command. A difference of 1 between next2Tx and next2Comp shows the issue.

a. vsish -e get /net/portsets/DvsPortset-<X>/ports/<switchport number>/vmxnet3/txqueues/<queue number>/status

i. Example:

Tx Hang Issue is Present.
vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/1/status

status of a vmxnet3 vNIC tx queue {

intr index:0

stopped:0

error code:0

next2Tx:787 Note here the issue exists. There is a value difference of 1 between next2Tx and next2Comp thus the issue is present.

next2Comp:788

genCount:348131

next2Write:788

next2Tx from timeout:980

next2Comp from timeout:788

timestamp in milliseconds in check:384765941

}

Tx Hang Issue is Not Present.
vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/0/status

status of a vmxnet3 vNIC tx queue {

intr index:0

stopped:0

error code:0

next2Tx:663 Note here the values are equal. The TX Hang issue is not present.

next2Comp:663

genCount:780117

next2Write:663

next2Tx from timeout:598

next2Comp from timeout:597

timestamp in milliseconds in check:0

}

Environment

VMware NSX-T Data Center

Cause

In a working scenario, the SPF port code calls an ESXi function to forward packets from the Guest VM if the Guest VM port is active. In this case, due to the Guest VM port undergoing a reset caused by a snapshot of the VM, the ESXi hypervisor is unable to process the packets being sent from the Guest VM vNic port. This behavior results in the I/O completion of the packet to get missed and the hypervisor proceeds to free the packet which in turn, is causing the Tx queue to hang and no longer process traffic.

Resolution

The issue is resolved in the following ESXI releases:
ESXi 7.0.3P08/7.0 U3o
ESXi 8.0.1P02/8.0 U1c

Workaround:

The stuck tx queue can be reset by disconnecting the impacted vNic via the vSphere GUI then reconnecting it.
A longer term workaround could be to to create a "no redirect rule" for the impacted traffic above the "redirect rule" to bypass the service insertion data-path.

Additional Information

Impact/Risks:

Once a Tx queue is hung, the datapath flowing through that queue will be broken until the vNic of where the hung queue sits, is reset (disconnect and reconnect in the VM settings).