Rx Errors observed on Edge nodes

Products

VMware NSX

Issue/Introduction

Rx errors counters can be observed increasing on the Edge nodes with command: get physical-port fp-ethX stats

"rx_bytes": 50938244949145,
"rx_drop_no_match": 131405052,
"rx_errors": 1518865,
"rx_misses": 0,
"rx_nombufs": 0,
"rx_packets": 44566952565,
"tx_bytes": 48976905544684,
"tx_drops": 0,
"tx_errors": 0,
"tx_packets": 42340387881

When checked on ESXi Host CLI which is hosting the Edge node, we can see in the switchport stats the rx_errors counter is equal to (or very close) to the counter: number of times packets are dropped by rx try lock queueing

stats of a vmxnet3 vNIC rx queue {
   LRO pkts rx ok:0
   LRO bytes rx ok:0
   pkts rx ok:45234120478
   bytes rx ok:51673765088900
   unicast pkts rx ok:45100243811
   unicast bytes rx ok:51662023759860
   multicast pkts rx ok:106833514
   multicast bytes rx ok:9166380254
   broadcast pkts rx ok:27043153
   broadcast bytes rx ok:2574948786
   running out of buffers:0
   pkts receive error:0
   1st ring size:4096
   2nd ring size:4096
   # of times the 1st ring is full:0
   # of times the 2nd ring is full:0
   fail to map a rx buffer:0
   request to page in a buffer:0
   # of times rx queue is stopped:0
   failed when copying into the guest buffer:0
   # of pkts dropped due to large hdrs:0
   # of pkts dropped due to max number of SG limits:0
   pkts rx via data ring ok:0
   bytes rx via data ring ok:0
   Whether rx burst queuing is enabled:0
   current backend burst queue length:0
   maximum backend burst queue length so far:0
   aggregate number of times packets are requeued:0
   aggregate number of times packets are dropped by PktAgingList:0
   # of pkts dropped due to large inner (encap) hdrs:0
   number of times packets are dropped by burst queue:0
   number of times packets are dropped by rx try lock queueing:1519081

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.x

Cause

Explanation for "number of times packets are dropped by rx try lock queueing" counter in switchport stats:

When multiple vmkernel networking threads try to deliver packet to the same vNIC concurrently, some serialization is needed. To avoid having thread spin waiting for lock which wastes CPU cycles, we instead have those packets queued up and the thread doing that can continue with other work. But for any queue, we have a queue size limit. When this limit is reached, we drop packets.
The reason a max queue size limit is needed is to avoid having too much packet buffer memory queued somewhere, depleting available packet memory which can start affecting traffic for other vNICs or vmknics.
RX errors: These drops can happen if there are multiple pollWorlds delivering packets to a particular vNIC queue. We have an upper bound (default 256) of how many packets can be queued before they can be processed for Rx delivery. If number of incoming packets exceed this limit, they will be dropped.

Resolution

Try increasing the queue size from the default of 256 to 512 or 1024 on the ESXi Host the Edge is running on using the below commands. There is a queue and a processing batch size that work in cooperation. Both will be modified, but the steps differ depending upon ESXi version.

On version 7.x the value of the advanced configuration option Vmxnet3RxPollBound controls both the processing batch size (Poll) and the software queue size (Queue). To change the values, use this command:
esxcfg-advcfg --set 512 /Net/Vmxnet3RxPollBound

On versions 8.0+ these advanced configuration values are modified separately, and it is recommended to increase the Queue size to double the Poll size, bearing in mind the max is 4096. To change the values, use these commands:
esxcfg-advcfg --set 1024 /Net/Vmxnet3RxQueueBound
esxcfg-advcfg --set 512 /Net/Vmxnet3RxPollBound

The current value can be verified with the below commands:
esxcfg-advcfg --get /Net/Vmxnet3RxPollBound
esxcfg-advcfg --get /Net/Vmxnet3RxQueueBound

Once the changes have been made, the vNIC needs to be reset or Edge VM has to be powered off and on.

Additional Information

Increasing default queue size to a larger size may result in longer latency.
A Support case may be required if the issue is not fixed by increasing the queue size or if it results into longer than expected latencies.

If you are contacting Broadcom support about this issue, please provide the following:

NSX Manager support bundles.
ESXi host support bundles for hosts that are failing to configure as transport nodes.
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation.

Handling Log Bundles for offline review with Broadcom support.