NSX Bare Metal Edge NIC flapping
search cancel

NSX Bare Metal Edge NIC flapping

book

Article ID: 324169

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • NSX 4.1.1 and above
  • Traffic on Bare Metal Edge experiences datapath disruption
  • NSX UI may report "Edge NIC Transmit Queue Overflow" alarms
  • Edge syslog shows a very high rate of TX hang detection followed by a NIC reset
2024-04-05T17:14:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 3 TX hang detected
2024-04-05T17:14:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully
2024-04-05T17:15:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 1 TX hang detected
2024-04-05T17:15:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully
2024-04-05T17:15:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 7 TX hang detected
2024-04-05T17:15:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully
2024-04-05T17:15:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 2 TX hang detected
2024-04-05T17:15:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully
2024-04-05T17:16:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 1 TX hang detected
2024-04-05T17:16:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully
2024-04-05T17:16:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 4 TX hang detected
2024-04-05T17:16:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully


Environment

VMware NSX 4.1.1

Cause

NSX 4.1.1 introduced a check that resets an Edge NIC when a TX hang condition is detected. This mechanism works as designed on Edge VM. On Bare Metal Edge, it may incorrectly diagnose a TX hang condition resulting in frequent NIC resets.

Resolution

This is a known issue impacting NSX Bare Metal Edge.

Workaround:
To immediately workaround the issue, disable the NIC reset feature.

On the Bare Metal Edge, as root user

# edge-appctl -t /var/run/vmware/edge/dpd.ctl stats/hung_nic_reset disable

Note, after applying the workaround, the Bare Metal Edge will continue to log the following messages in syslog which can be safely ignored. 
  • "edge_nic_transmit_queue_overflow" alarm with processed packet count as 0. This can be safely ignored.
2024-04-05T19:58:43.372Z Edge1 NSX 9458 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="9909" level="FATAL" eventState="On" eventFeatureName="edge_health" eventSev="critical" eventType="edge_nic_transmit_queue_overflow"] Edge NIC fp-eth2 transmit queue 15 has overflowed by 100.000000% on Edge node xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. The missed packet count is 15855 and processed packet count is 0.
  • "NIC fp-ethX queue X TX hang detected" messages. This can be safely ignored.
var/log/syslog:2024-04-05T19:44:09.497Z Edge1 NSX 9458 FABRIC [nsx@6876
comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43"
level="WARN"] NIC fp-eth0 queue 1 TX hang detected


Also NSX UI may report "Edge NIC Transmit Queue Overflow" alarms. These can be safely ignored or can be suppressed if required.

This change does not persist a reboot or datapath restart.

For a persistent workaround install the script attached to this KB.

Script Installation

1) Copy the 2 scripts to a location on the Edge, put both scripts in the same folder e.g.

# mkdir /image/disable_nic_hung_check
# ls -lt
-rw-r--r-- 1 root root  804 Apr  10 04:48 cron_helper.sh
-rwxr-xr-x 1 root root 4579 Apr  10 04:32 disable_nic_hung_check.py


2) Validate the md5 of both scripts matches these outputs

# md5sum disable_nic_hung_check.py
20257cba75944db1a4424fd582a35f8e  disable_nic_hung_check.py
# md5sum cron_helper.sh
39793e102e35f2bb212f31e3ffa6096b  cron_helper.sh


3) Install the script
# cd /image/disable_nic_hung_check
# sh cron_helper.sh

no crontab for root
no crontab for root

This will copy the python script to a permanent location and create two cron jobs.

4) Confirm installation
File exists now in permanent location
#ls -lt /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
-rwxr-xr-x 1 root root 4579 Apr  10 04:49 /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py

Two cron jobs have been created
# crontab -l
* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
* * * * * sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py


Operational Validation

Cron is running
# grep CRON.*disable /var/log/syslog
2024-04-10T10:44:01.662Z edge01.corp.local CRON 3870920 - -  (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:44:01.538Z edge01.corp.local CRON 3870919 - -  (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:45:01.432Z edge01.corp.local CRON 3871473 - -  (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:45:01.073Z edge01.corp.local CRON 3871483 - -  (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:46:01.837Z edge01.corp.local CRON 3871979 - -  (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:46:01.762Z edge01.corp.local CRON 3871980 - -  (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)



If the script detects a reboot or datapath service restart, it will disable the feature and log to /var/log/syslog

2024-04-10T10:44:01.803Z edge01 NSX 3870922 - [nsx@6876 comp="nsx-edge" subcomp="disable-nic-hung" username="root" level="INFO"] Datapathd bootup/restart detected. Disabled NIC TX hung reset feature...

The node will continue to log the "edge_nic_transmit_queue_overflow" and "TX hang detected" after application of the script. The NSX UI may continue to report  "Edge NIC Transmit Queue Overflow" alarms. These can be safely ignored.


Script uninstallation

1) Validate cron entries present
 
# crontab -l
* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
* * * * * sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py

2) In example output in step 1), the only crontab entries are for the disable_nic_hung_check.py workaround script so all can be removed with one command

# crontab -r
# crontab -l
no crontab for root


If other crontab entries are present then crontab -r should not be used as it will delete all of them.
Instead use crontab -e to delete the 2 entries relating to the disable_nic_hung_check.py script.
crontab -e opens a vi editor where "dd" command is used to delete each line and :wq saves and quits.

Attachments

cron_helper get_app
disable_nic_hung_check get_app