Intermittent reachability issues observed between on-prem (source) and cloud VM (target) when MON is enabled for HCX cloud-to-cloud deployments
book
Article ID: 401566
calendar_today
Updated On:
Products
VMware HCX
Issue/Introduction
When MON enabled and intermittent packet loss is observed from on-prem VM (source) to cloud VM (target)
On the ESXi that source VM is registered you do not see the target IP address for the Neighbor entry. Firstly determine VDR UUID for the T1 Gateway by running the below command on the ESXi
nsxcli -c get logical-routers Logical Routers Summary ------------------------------------------------------------------------------------------------------------------ VDR UUID LIF num IPv4 Route num IPv6 Route num Max Neighbors Current Neighbors 9aab434f-####-####-####-############ 2 13 22 50000 10 221bf8a39-####-####-####-########### 5 6 9 50000 16
Using the VDR UUID for the T1 Gateway command nothing is returned for the target IP address
nsxcli -c get logical-router 221bf8a39-####-####-####-########### neighbor 192.168.100.10
Environment
VMware HCX
Cause
This issue occurs due to a timing mismatch in ARP (Address Resolution Protocol) table entries across different ESXi hosts VDR configuration:
ARP Table Synchronization Issue: NSX maintains separate ARP tables for each logical router on each ESXi host. These tables can expire at different times.
Failed ARP Resolution Process:
When the ARP entry expires in the on-premises VM's host, it sends a broadcast ARP request.
The HCX Network Extension appliance's host intercepts this request.
If NE ESXi host ARP table still has a valid entry, it converts the broadcast to a unicast ARP request.
With MON enabled, this unicast packet gets rewritten with special HCX MAC.
Packet Drop: The modified unicast ARP request is dropped by NSX's built-in security policy at the destination, preventing ARP resolution.
Resolution
This is a known issue impacting VMware HCX
This condition only exists in the following scenarios:
Short-lived connections which allow the VDR neighbor table to age out.
When source & on-prem utilize NSX and MON is enabled.
Workarounds:
Disable MON.
Perform a constant PING between source/destination VM or establish some sort of connection that is constant and not short lived (less than 600 seconds).
Create VM affinity rules to stick VM's to ESXi host which has NE-I appliance.