Edge NIC Transmit Queue Overflow
" alarmsThis issue is resolved in VMware NSX 4.2.0
Workaround:
To immediately workaround the issue, disable the NIC reset feature.
On the Bare Metal Edge, as root user
# edge-appctl -t /var/run/vmware/edge/dpd.ctl stats/hung_nic_reset disable
Note, after applying the workaround, the Bare Metal Edge will continue to log the following messages in syslog which can be safely ignored.
edge_nic_transmit_queue_overflow
" alarm with processed packet count as 0. This can be safely ignored.2024-04-05T19:58:43.372Z Edge1 NSX 9458 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="########-####-####-####-##########" tid="9909" level="FATAL" eventState="On" eventFeatureName="edge_health" eventSev="critical" eventType="edge_nic_transmit_queue_overflow"] Edge NIC fp-eth2 transmit queue 15 has overflowed by 100.000000% on Edge node ########-####-####-####-##########. The missed packet count is 15855 and processed packet count is 0.
NIC fp-ethX queue X TX hang detected
" messages. This can be safely ignored.var/log/syslog:2024-04-05T19:44:09.497Z Edge1 NSX 9458 FABRIC [nsx@6876
comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43"
level="WARN"] NIC fp-eth0 queue 1 TX hang detected
Also NSX UI may report "Edge NIC Transmit Queue Overflow
" alarms. These can be safely ignored or can be suppressed if required.
This change does not persist a reboot or datapath restart.
For a persistent workaround install the script attached to this KB.
Script Installation
ls -lt /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
-rwxr-xr-x
1
root root
4579
Apr 10
04
:
49
/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
# crontab -l
* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
* * * * * sleep
30
; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
Operational Validation
Cron is running
# grep CRON.*disable /var/log/syslog
2024-04-10T10:44:01.662Z edge01.corp.local CRON 3870920 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:44:01.538Z edge01.corp.local CRON 3870919 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:45:01.432Z edge01.corp.local CRON 3871473 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:45:01.073Z edge01.corp.local CRON 3871483 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:46:01.837Z edge01.corp.local CRON 3871979 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
2024-04-10T10:46:01.762Z edge01.corp.local CRON 3871980 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)
If the script detects a reboot or datapath service restart, it will disable the feature and log to /var/log/syslog2024-04-10T10:44:01.803Z edge01 NSX 3870922 - [nsx@6876 comp="nsx-edge" subcomp="disable-nic-hung" username="root" level="INFO"] Datapathd bootup/restart detected. Disabled NIC TX hung reset feature...
The node will continue to log the "edge_nic_transmit_queue_overflow" and "TX hang detected" after application of the script. The NSX UI may continue to report "Edge NIC Transmit Queue Overflow" alarms. These can be safely ignored.
Script uninstallation
# crontab -l
* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
* * * * * sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py
# crontab -r
# crontab -l
no crontab for root