Intermittent network issues during vSAN/vMotion traffic with qedentv driver
search cancel

Intermittent network issues during vSAN/vMotion traffic with qedentv driver

book

Article ID: 317657

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
In rare circumstances involving heavy vSAN or vMotion network traffic, intermittent network issues due to the qedentv driver could potentially emerge, leading to the failure of vMotion or errors transacting over vSAN. For vSAN/vMotion traffic, under certain conditions, vmkernel instantiates netqueue or netqueue RSS as appropriate. In high load conditions, due to an intermittent timing issue with the qedentv driver, receive traffic does not flow correctly through the netqueue. This can lead to the MAC filter associated with the vmknic that is used for vSAN/vMotion traffic rapidly moving back and forth between default queue and netqueue. This can result in heartbeat failures being reported by vSAN, or there may be vMotion failures due to a timeout connecting to destination host or incomplete transfer of VM memory pages.
 
It should be noted that MAC filter movement between default and netqueue is in itself normal. However, when this filter movement happens quickly and is accompanied by other associated traffic failures, this likely indicates a manifestation of this problem. In the vmkernel logs, messages similar to those shown below will be repeatedly observed. There may also be vSAN/vMotion failure messages interspersed with driver messages.

vmnic2)]Removing mac:00:50:56:62:c8:9a, vlan_id:0x0, from fp:0, op:MAC_DEL, hw_fn:0
vmnic2)]Applying 00:50:56:62:c8:9a filter, vlan_id:0xffff, fp_id:1, hw_fn:0.
vmnic3)]Feature RSS needed.
<snip>
WARNING: VMotionUtil: 4060: 1195397824256221929 S: Stream completion work failed: Timeout
WARNING: Migrate: 273: 1195397824256221929 S: Failed: Timeout (0xbad0021) @0x4180146cb675
WARNING: VMotionUtil: 850: 1195397824256221929 S: failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
<snip>
vmnic2)]Removing mac:00:50:56:62:c8:9a, vlan_id:0x0, from fp:1, op:MAC_DEL, hw_fn:0
vmnic2)]Applying 00:50:56:62:c8:9a filter, vlan_id:0xffff, fp_id:0, hw_fn:0.


Cause

The problem is due to a corner case timing condition in qedentv driver during netqueue delete operation that could lead to mismatch in indices within the interrupt generation logic on the adapter and impact receive traffic.

Resolution

This issue is fixed in qedentv driver version 3.11.7.0 and later releases, so this driver should be updated to a currently supported version available at the Broadom Support Portal, see KB 366685 for details .

Workaround:
Workaround would be to disable netqueues on qedentv interfaces. This can be done using driver module parameter as shown below. The example assumes there are four qedentv instances.

[root@host:~] esxcfg-module -g qedentv
qedentv enabled = 1 options = ''
[root@host:~] esxcfg-module -s "num_queues=0,0,0,0 RSS=0,0,0,0" qedentv
[root@host:~] esxcfg-module -g qedentv
qedentv enabled = 1 options = 'num_queues=0,0,0,0 RSS=0,0,0,0'


Reboot system for settings to take effect and will apply to all NICs managed by the qedentv driver.

It should be noted that disabling netqueue will result in some performance impact. The magnitude of the impact will depend on individual workloads and should be characterized before deploying the workaround in production. However, in most cases, the performance impact is not noticeable.