Cloud VM's using HCX Extended Network (without MON) encountering abrupt loss of connectivity to on-prem gateway/VM's when on-prem NE and/or cloud VM is vMotioned.
This behavior has been found to be caused by the on-prem NSX enabled DVS utilizing multiple uplinks (non-LAG) and VM ports configured with both ‘mac-learning’ and ‘sink port’ configurations. When a user extends a network using HCX, HCX first looks at what type of port-group its extending. If it is a ESX DV Port-Group then the NE adapter servicing the L2E is created with ‘sink port’ configuration.
If the network being extended is a NSX vlan segment the NE adapter is created with ‘mac-learning’ config.
Due to there being a mix of ‘sink port’ and ‘mac-learning’ enabled ports in the DVS, and there are multiple DVS uplinks (non-LAG), the mac address of the cloud workload vm is being learned from the uplink and the reply from on-prem gateway/VM is not forwarded to NE vm as expected.
This issue is resolved in VMware NSX 3.2.4
This issue is resolved in VMware NSX 4.2.0
Workaround:
There are two ways to mitigate this issue.
In order to check what the sink port status is on a VM you can use the below commands
nsxdp-cli vswitch instance list[] nsxdp-cli vswitch instance list
DvsPortset-2 (HCX-NVDS) 46 6a 50 7c 88 3c 48 c7-b5 ca ## ## ## ## ## ##
Total Ports:3840 Available:3813
Client PortID DVPortID MAC Uplink
Management 13421#### 00:00:00:00:00:00 n/a
vmnic6 228170
####
7bd730a1-####-####-####-32afb3c4e338 00:00:00:00:00:00 Shadow of vmnic6 13421
####
00:50:56:##:##:## n/a vmnic7 228170
####
64488cd9-####-####-####
-bcbaaeecfd13 00:00:00:00:00:00 Shadow of vmnic7 13421
####
00:50:56:##:##:##
n/a vmk10 13421
####
c422dbef-####-####-####
-e5b5e386cab2 00:50:56:##:##:##
vmnic6 vmk50 13421
####
5d71d732-####-####-####
-9e4d81e0e517 00:50:56:##:##:##
void vdr-vdrPort 13421
####
vdrPort 02:50:56:##:##:##
vmnic6
Below will show mac-learning status[] nsxdp-cli vswitch mac-learning port get -p 7bd730a1-####-####-####-32afb3c4e338 -dvs HCX-NVDS
MAC Learning: False
Unknown Unicast Flooding: False
MAC Limit: 4096
MAC Limit Policy: ALLOW
And now sink port status [] nsxdp-cli vswitch sink get -p 7bd730a1-####-####-####-32afb3c4e338 --dvs HCX-NVDS
disabled
NOTE: Be sure to use the portid associated with the VM_eth interface using the on-prem portgroup.
You may use the below commands to perform a packet capture on ESXi physical uplink and then on VM network adapter to confirm whether the ARP packet is forwarding to the VM correctly.
pktcap-uw --uplink vmnic1 --capture UplinkSndKernel,UplinkRcvKernel --ip #.#.#.# -o - | tcpdump-uw - -r -ean
Note: Please verify which vmnic the vm adapter is using with the nsxdp-cli
command.
The following will perform a trace on VM adapter level. Confirm the switchport ID
pktcap-uw --switchport 5033#### --capture VnicTx,VnicRx --ip #.#.#.# -o - | tcpdump-uw - -r -ean
Impact/Risks:
Reply from on-prem gateway/VM is not forwarded to on-prem NE VM as expected. This can be observed via packet trace. You will see the ARP response from gateway hit the physical ESXi vmnic, but if you trace on VM interface/port ID you will see the ARP response was not forwarded from physical nic to VM network adapter.