VMNIC link may drop on Cisco UCS Host running certain nenic driver versions
search cancel

VMNIC link may drop on Cisco UCS Host running certain nenic driver versions

book

Article ID: 330229

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
- Cisco UCS hosts with ESXi 6.5 installed running nenic driver prior to 1.0.11.0 may experience loss of network connectivity
- VMNICs/pNICs tied to VXLAN VTEPs may lose network connectivity
- In UCS Manager, the VIFs will be in an Error Disabled state
- Bouncing the VIF from UCS or the vmnic from esxcli does not bring the link back up
- The following error stack is seen in /var/log/vmkernel.log on an affected ESXi host
 

cpu26:66568)WARNING: nenic: enic_queue_wq_cont:943: [000:010:00.0] Failed to get pkt sg elem
cpu26:66568)WARNING: nenic: enic_tq_xmit_pkt:1208: [000:010:00.0] Drop packet!
cpu26:66568)WARNING: nenic: enic_queue_wq_cont:943: [000:010:00.0] Failed to get pkt sg elem
cpu26:66568)WARNING: nenic: enic_tq_xmit_pkt:1208: [000:010:00.0] Drop packet!
cpu22:67077)WARNING: nenic: enic_log_q_error:137: [000:010:00.0] WQ[0] error_status 10
cpu22:67077)WARNING: nenic: enic_isr_msix_err:188: [000:010:00.0] Scheduled soft reset to recover from error
cpu49:66235)nenic: enic_soft_reset_helper:2395: [000:010:00.0] Resetting
cpu49:66235)nenic: enic_soft_reset_helper:2410: [000:010:00.0] Reset completed
cpu29:66322)nenic: enic_link_check:247: [000:010:00.0] Link DOWN
cpu29:66322)netschedHClk: NetSchedHClkNotify:2892: vmnic5: link down notification


Cause

This issue occurs due to a Cisco Bug with nenic drivers https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvf36545/?rfs=iqvred

It is possible for vmkernel to pass in packets with vmk_PktFrameLenGet() smaller than the sum of individual sg lengths. Drivers are supposed to properly handle such packets and program the accurate length to hardware descriptors. The issue occurs when the sg number is 1

Resolution

This issue is fixed in nenic drivers 1.0.11.0 and later as noted in Cisco VIC Release Notes https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/release/notes/VIC/3-2/b_CiscoVIC_Drivers-RN-3-2.pdf

CSCvf36545: This fix addresses the issue of a length mismatch between the VMWare packet frame length API function and the sum of the individual fragment length in cases of a single fragment. With this native ENIC driver, it now properly handles such packets.

Workaround:
Per the Cisco article CSCvf36545, the host can be rebooted to recover the uplinks from the Err-Disabled state. However, there's a possibility that the issue would re-occur during the next upgrade. Hence it is recommended to upgrade the drivers on the ESXi host to a version later than 1.0.11.0, prior to NSX upgrade

Additional Information

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvf36545/?rfs=iqvred