Intermittent connectivity issues to VM due to ARP learning issues caused by stale IP/MAC mapping
search cancel

Intermittent connectivity issues to VM due to ARP learning issues caused by stale IP/MAC mapping

book

Article ID: 399902

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Whenever there is an ARP entry timeout, the traffic destined to VM1 is forwarded to VM2 residing on the same segment.
  • Packet captures on the pNIC of the source Edge node or Transport Node for packets sent to VM1 show that the inner destination MAC address does not match the expected value.
  • You find that the inner destination MAC address is the MAC address for VM2 instead of the expected address for VM1.
  • There is usually something in common between VM1 and VM2. For instance VM1 and VM2 maybe part of a third-party load balancer application which is configured in a high availability group.
  • When the issue is live you find that pinging VM1 will resolve the connectivity issue.
  • Log lines similar to the following are seen in the /var/log/syslog of the source Edge which indicate that it is learning the incorrect MAC address from NSX 'ARP Suppression':

<EDGE FQDN> NSX 9061 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="arp" level="INFO"] ARP reply >received for <VM1 IP> from <VM2 MAC Address> on lrouter port <LR Port ID>
<EDGE FQDN> NSX 9061 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="lswitch" level="INFO"] ARP >reply sent to <VDR MAC Address> for <VM1 IP> from <VM2 MAC Address> by ARP Suppression

  • Log lines similar to the following are seen in the /var/log/syslog of the source Edge or TN which indicate that an ARP entry is added for VM1 which is mapped to the MAC address of VM2:

    <EDGE FQDN> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lswitch" level="INFO"] add ip-mac to lswitch > <Logical Switch UUID>: VM1 IP/VM2 MAC Address
    <EDGE FQDN> NSX 9061 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dpc-pb" tname="dp-ipc31" >level="INFO"] Add ARP entry (VM1 IP, VM2 MAC Address) to lswitch <Logical Switch UUID>

Environment

VMware NSX
VMware NSX-T Data Center

Cause

  • A Stale ARP record remains in the edge's NestDB. The CCP would usually send an instruction to the Transport Node/Edge Node to remove the old/stale ARP entry from its NestDB.
  • There is a known issue where the instruction to remove the stale ARP record can fail to be sent.
  • Each time a host sends out an ARP request for the IP, due to its ARP cache entry timing out, ARP suppression will reply with the stale MAC address until it is removed from the NestDB.

Resolution

Workaround:

  • Do a restart of the proxy service on the Edge Node or Transport Node with the stale ARP entry using the following command as the root user:

/etc/init.d/nsx-proxy restart

  • Fix available in 3.2.4 ,4.1.1 and above.

Additional Information

You can validate if the stale ARP entry exists in the NestDB using the following steps:

  1. Logon to the Edge CLI as the root user or logon as admin and switch into root using the 'st eng' command.
  2. Run the following command to enter the nestdbcli:
    - /opt/vmware/nsx-nestdb/bin/nestdb-cli --beautify --json
  3. Run the following command to dump data about the logical switch the VM's with the issue are connected to:
    - get vmware.nsx.nestdb.LogSwitchFibMsg {"id":"Logical Switch UUID"} 
  4. Copy the output from the command into notepad and search it for any duplicates of the IP address of interest. There should just be a single entry found referencing the IP address, and that entry should be mapped to the expected MAC address.
  5. Enter the following command to leave the nestdbcli:
    - quit