Symptoms:
VMware NSX from 3.2.0.1 and newer
Any ESXi version compatible, vSAN FS Services were introduced in ESXi 7.0
The issue can be described in multiple steps:
1. The container startup scripts using arping with -A option, which generate ARP reply packet with a wrong option
2. Since the replication mode is MTEP replication, and two edges' VTEP are in different subnet as ESXi (as per best practice), the ESXi host picks one edge to perform MTEP replication.
3. When the Edge received the ARP reply, it updated its mac-vtep mapping. Since it's an ARP reply with unicast target mac address in the payload, the Edge does not uses the routing-domain for replication, but instead uses infrastructure (logical switch/ segment) for replication and the ARP reply is not replicated to the other edge in the cluster that did not learn correctly the updated mac-vtep mapping, causing the issue.
The solution is to manually run this command inside the container. In this example 192.169.1.1 is the IP address of the eth0 interface of the vSANFS container VM. (replace with the actual address) since we cannot modify the startup script.
/usr/sbin/arping -b -c 1 -U -I eth0 192.168.1.1
The issue is fixed in NSX version 4.2.1
The issue seems to be more likely to happen with NSX Bare Metal Edges .