ESXi host configured as an NSX transport node experiences unexpected and intermittent outages, possibly at specific times of day.
book
Article ID: 427674
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
NSX is Federated with at least two Local Manager sites.
NSX prepared ESXi host experiences partial or complete outage on its uplink supporting its NSX TEP.
This outage is not observed on uplinks not used by NSX.
Capturing traffic on the uplink used by the NSX TEP, example pktcap-uw command below, during the time of the outage and viewed in Wireshark shows the host receiving a packet sourced from north of NSX and destined for a VM at a different NSX Federation site. A large number of what Wireshark calls TCP Retransmissions are observed following each of these packets.
Wireshark example: Inbound packet seen at UplinkRcvKernel. The Src and Dst IP's under the "Internet Protocol Version 4 header corresponds to the Edge TEP IP and the affected Host TEP IP respectively.
Outbound packet seen at UplinkSndKernel. Now the Internet Protocol Version 4 header shows the affected Host TEP IP as the Src and other hosts with TEPs in the same subnet for the Dst.
Looking closer at Wireshark, the first packet is observed entering the host at UplinkRcvKernel in the Packet comments, and each of the TCP Retransmissions are observed on UplinkSndKernel as the host is sending them out.
Further, looking at the Geneve outer header it can be seen that each of the TCP Retransmissions are destined to the TEP IP of the other NSX prepared ESXi hosts on its same TEP subnet.
Viewing netstats (example command below) for the affected vmnic during the time of the issue may show a very large number of 'txeps' (Transmit Errors per Second).
net-stats -i 30 -ticqQWS -A > /path/to/filesystem/netstats
The cause of why the NSX Edge was sending this traffic south as BUM traffic in the first place rather than across the RTEP to the other Federated site where the destination VM lived is unknown at the time of this writing.
Resolution
Workaround
This behavior stopped occurring once the NSX Edges were upgraded to version 4.2.3.2.