Troubleshooting Inter Site VM Connectivity on HCX Layer 2 Extensions(L2E) Using PCAP.

Products

VMware HCX

Issue/Introduction

Virtual machines (VMs) connected to an HCX Layer 2 Extension (L2E) network experience connectivity failures across sites. Symptoms include the inability to ping or pass TCP/UDP traffic between source and destination VMs residing on the extended L2 segment.

This procedure is structurally required because HCX L2E encapsulates guest VM traffic within UDP 4500 (IPsec) for transport across the underlay network. Standard guest-level network diagnostics (e.g., ping, traceroute) are insufficient to determine the fault domain. A structured packet capture (PCAP) methodology on the HCX Network Extension (NE) appliances is mandatory to trace both the unencapsulated and encapsulated datapath. This isolates whether the packet drop occurs at the source hypervisor vDS, the local NE appliance internal interfaces, the transit WAN underlay, the destination NE appliance, or the destination hypervisor.

Environment

HCX

Cause

Inter-site communication failure on an HCX Layer 2 Extension occurs when the end-to-end datapath is interrupted. Due to the architecture of the HCX NE, packet loss can occur at multiple discrete enforcement boundaries. Specific root causes targeted for isolation via this methodology include:

Underlay MTU Truncation: Fragmentation or ungraceful dropping of encapsulated UDP 4500 packets across the physical WAN or underlay network due to insufficient MTU sizing along the transit path.
Underlay Security Policies: Physical or logical firewalls dropping UDP port 4500 (IPsec) traffic between the Source NE Uplink IP and Destination NE Uplink IP addresses.
Data Plane Resource Exhaustion: NE appliance CPU or memory contention resulting in internal Rx/Tx ring buffer drops before packets can be successfully encapsulated or decapsulated.
L2 Forwarding Failures: Stale MAC learning tables or improper ARP resolution on the backing vSphere Distributed Switch (vDS) or NSX logical switch, preventing source VM traffic from arriving at the internal interface of the source NE appliance.

Resolution

Use the following packet capture (PCAP) methodology on the HCX Network Extension (NE) appliances to isolate the point of failure along the Datapath.

Note: This assumes familiarity with ESXi, NSX, and HCX basics, including ESXi networking (vSwitch/VDS, portgroups, vmnic uplinks, pktcap-uw), NSX constructs (segments and gateways), and the ability to identify HCX NE interfaces such as uplink, tunnel, and L2E sink ports handling Layer 2 extension traffic.

Here are the details I am assuming for this troubleshooting.

The HCX deployment consists of a single Service Mesh with only the Network Extension (NE) service enabled. The mesh includes one NE appliance pair, and VLAN 192 is extended across sites using Layer 2 Extension (L2E).

Src VM:
- IP: 192.168.1.10
- MAC: AA:AA:AA:AA:AA:AA
Dst VM:
- IP: 192.168.1.20
- MAC: BB:BB:BB:BB:BB:BB
MON Disabled.
Default Gateway
- IP: 192.168.1.1(OnPrem)
- MAC: DD:DD:DD:DD:DD:DD
- VLAN 192
Ping from Src VM(192.168.1.10) to Dst VM(192.168.1.20)

Below is the end-to-end traffic flow (On-prem VM → Cloud VM on extended network) and highlights the key locations where packet captures can be performed for analysis.”

Packet Capture Points in OnPrem Environment:

Capture Point 1: SSH to the ESXi host where the Src VM is up and running

Src VM (vNIC) → vSwitch/VDS (Portgroup – VLAN 192) → Physical NIC (vmnic – uplink) →

Capture Point 2: Would need to involve physical network admin/vendor

→ Physical Network (ToR / upstream switch) →

Capture Point 3: SSH to the ESXi host where the HCX NE appliance is up and running

→ Physical NIC (vmnic – HCX NE Sink Port Uplink) → vSwitch/VDS (HCX NE Sink Port Network) → HCX NE appliance L2E Sink Port (vNIC)

Note: If the HCX tunnel keeps dropping or going down, you need to perform a packet capture between Capture Point 4 and Capture Point 6 for UDP 4500 traffic, as it is encrypted.

Capture Point 4: SSH to the ESXi host where the HCX NE appliance is up and running.

→ HCX NE Appliance Uplink Interface(vNIC) → vSwitch/VDS (HCX NE Uplink Network) → Physical NIC (vmnic – HCX NE host) →

Packet Capture Points in Cloud Environment:

Capture Point 5: Would need to involve the physical network admin/vendor managing the underlay network. For example, this would be the Microsoft team if the site-to-site connectivity between on-prem and Azure is managed via ExpressRoute.

→ Physical Network (ToR / upstream switch/ express route etc) →

Note: This is where the troubleshooting can become more complex, as it depends on how the environment is designed and deployed. In some cases, you may need to inspect the Tier-0 (T0) and Tier-1 (T1) gateways in NSX to trace the traffic flow.

By performing the packet capture outlined below, you can determine whether the UDP traffic is being received at the next capture point. Based on these observations, you can identify where the traffic is being dropped or not forwarded, and accordingly decide whether additional packet captures are required at the T0/T1 level to further isolate the issue.

Capture Point 6: SSH to the ESXi host where the HCX NE appliance is up and running.

→ Physical NIC (vmnic – HCX NE Uplink) → vSwitch/VDS (HCX NE Uplink Network) → HCX NE Appliance Uplink Interface(vNIC)

→ HCX NE appliance L2E Sink Port (vNIC) → NSX Segment (Extended VLAN 192) → Physical NIC (vmnic – NE Sink Port Uplink) →

Capture Point 7: Would need to involve physical network admin/vendor

→ Physical Network (ToR / upstream switch) →

Capture Point 8: SSH to the ESXi host where the Dst VM is up and running

→ Physical NIC (vmnic – Dst VM Uplink) → NSX Segment (Extended VLAN 192) → Dst VM (vNIC)

Below are generic commands that can be used across various components,

VM vNIC:
- For Windows Guest OS, use Wireshark
- For Linux, use tcpdump
- For custom OS, use vendor-specific tools and commands
NE vNIC
- Within the HCX NE appliance, you can also run tcpdump. An example is shown below:
  - Uplink network interface:
    - Handles the transmission and reception of encapsulated traffic between HCX NE appliances across sites.
    - tcpdump -ni vNic_# <filters here>
      - example: tcpdump -ni vNic_1 udp and port 4500 (established/UP IPSec tunnel)
  - Sink Port network interface:
    - Used by the HCX NE appliance to send and receive unencapsulated traffic for the stretched (L2 extended) network.
      - tcpdump -ni vNic_# <filters here>
        example: tcpdump -ni vNic_2 grep -i 192.168.1.20 | grep -i 192.168.1.20
VM or NE SwitchPort:
- pktcap-uw --switchport ######## --capture VnicTx,VnicRx -o /vmfs/volumes/FULL_PATH_TO_DATASTORE/Case_12345678/esxi.switchport.########.VnicTxRx.pcapng &
  Example for directing the output to the screen:
  
  pktcap-uw --switchport ######## --capture VnicTx,VnicRx -o - | tcpdump-uw -r - -enn
VM or NE VMNIC:
- pktcap-uw --uplink vmnic# --capture UplinkSndKernel,UplinkRcvKernel -o /vmfs/volumes/FULL_PATH_TO_DATASTORE/Case_#/esxi.uplink.vmnic#.UplinkSndRcvKernel.pcapng &
  
  Example for directing the output to the screen, traffic gathered with this command is also capturing both sent and received traffic:
  
  pktcap-uw --uplink vmnic# --capture UplinkSndKernel,UplinkRcvKernel -o - | tcpdump-uw -r - -enn

For more information on using the pktcap-uw tool for packet capture and analysis, please refer to the KB article: Packet capture on ESXi using the pktcap-uw tool

You can start directly at the VM or NE switch port, allowing you to avoid packet captures within the Guest OS and HCX NE appliance where possible, thereby reducing complexity related to access and file extraction, as the HCX appliances require SSH access via HCX Manager.