HCX - Network connectivity failure over Network Extension for MON enabled VM after manual HA failover
search cancel

HCX - Network connectivity failure over Network Extension for MON enabled VM after manual HA failover

book

Article ID: 303662

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

When performing a manual HA failover for HCX Network Extension appliances with Mobility Optimized Networking (MON) enabled, a network connectivity disruption can occur between cloud VMs and on-premises networks. This issue specifically impacts L3 traffic from cloud VMs to on-premises networks while L2 connectivity remains unaffected.

The following symptoms indicate this issue:

  • L3 traffic from cloud VMs with MON enabled to any on-premises subnet fails after manual HA failover
  • L2 traffic between cloud VMs and on-premises VMs on the same stretched network continues to work
  • Policy-based routing stops working after the failover event
  • Continuous ping tests from cloud VMs to on-premises VMs on different networks show packet loss

Environment

VMware HCX

VMware NSX

Cause

The root cause involves the interaction between HCX MON configuration and NSX-T policy routes. When MON is enabled, HCX configures policy routes on the NSX Tier-1 cloud gateway to handle traffic routing between cloud and on-premises resources. These routes are tied to logical ports associated with the active Network Extension appliance.

During a manual HA failover

  1. The standby NE appliance becomes active and begins forwarding traffic
  2. The original active appliance is deactivated
  3. Due to a limitation in NSX-T versions prior to 4.0.0.0, the policy routes' logical port references are not automatically updated to point to the newly active appliance
  4. This results in traffic being dropped as it cannot reach the correct Network Extension appliance

The issue can be identified through packet capture analysis showing

  • ARP requests from the on-premises gateway reaching the cloud NSX edge but not the destination VM
  • Traffic flows stopping at specific points in the NSX datapath
  • Route table entries showing incorrect or missing next-hop information

Resolution

Short-term workaround

  1. Access the HCX Network Extension MON UI
  2. Navigate to Member Virtual Machines > Default Target Router Location
  3. Change the router location from cloud to on-premises gateway
  4. Change it back to cloud gateway
  5. Verify traffic restoration

Long-term solution

Upgrade to NSX-T 4.0.0.0 or later, which includes a fix for automatic policy route updates during HA failover events.

Additional Information

Diagnostic steps to confirm the issue

Use tcpdump or pktcap-uw on the NSX edge to verify traffic flow

On NSX edge, capture ARP traffic

tcpdump -e -i vNic_2 -n -S arp

Check NSX-T manager logs for policy route configuration

Look for policy route updates

grep "PolicyConnectivity" /var/log/vmware/nsx-manager/manager.log

Examine the logical router port status before and after failover

Check logical router forwarding table

get logical-router <router-id> forwarding

Monitor ARP resolution on affected networks

Use pktcap-uw for detailed packet analysis

pktcap-uw --switchport <port-id> --dir 1 --stage 0

Key log locations for troubleshooting

  • NSX-T Manager: /var/log/vmware/nsx-manager/manager.log
  • HCX: /var/log/vmware/hcx/
  • Network Extension appliance: system logs

Impact scope

  • Only affects manual HA failover scenarios
  • Limited to L3 traffic with MON enabled
  • No impact on L2 connectivity or migration workflows
  • Configuration operations continue to function normally

Best practices

  • Perform failover testing in a controlled environment
  • Monitor traffic patterns before and after failover
  • Keep NSX-T updated to latest supported version
  • Document affected workloads for faster recovery

Indicators of the issue

When troubleshooting, look for these specific patterns in the logs:

In NSX-T manager logs

StaticRoutingServiceImpl: Persisting config for new static route DLRStaticRouteCCPFacadeImpl: Marking delete NextHop

In packet captures

Missing ARP responses from cloud

VM Request who-has 192.168.x.x tell 192.168.x.x

Check NSX segment connectivity status

Look for connectivity state changes

type="DISCONNECTED","connectivity":"OFF"