TroubleShooting Hyperscaler Connectivity Issues

Products

VMware Cloud on AWS VMware NSX VMware NSX Networking VMware NSX-T Data Center

Issue/Introduction

The connectivity between public cloud and On-Premises environments involve three basic areas. Each section has different responsibilities for customers and vendors.
The local, physical infrastructure (On-Premises) is the customer's area of responsibility.
The On-Premises virtual environment or Private Cloud, is for VMware administrator.
The VMware virtual environment (SDDC) supported in the public cloud is also the responsibility of the VMware administrator.
The public cloud's (hyperscaler) private networks circuits (i.e. Direct Connect, Global Reach) and physical infrastructure are the vendor's (i.e. AWS, Azure, Oracle) responsibility
The VMware administrator must be able to show that the virtualized environment is delivering performance and packets as designed based on the limitations of the On-Premises and cloud environments
This will article will give a general method to demonstrate packets are leaving the areas of responsibility of the VMware products and handing them off to the physical infrastructure.
Once this is demonstrated, the focus of the investigation will require the On-Premises network team and hyperscaler vendor to troubleshoot what is occurring after leaving the VMware area of responsibility.

Environment

NSX

Cause

The private connection provided by the hyperscaler vendor can be suffering misconfigurations or components failures. This will have a direct effect on network traffic between the two sites.
The hyperscaler vendor's infrastructure can have issues physically or logically.
The VMware virtual environment can also be suffering misconfigurations or unexpected outcomes from upgrades.

Resolution

The resolution is to identify where the issue may lie. This will be done through the use of packet captures. These captures will follow the state of packets as they enter and leave the VMware environment.
The goal is to prove that the VMware environment is sending and receiving packets normally. This will answer the question of whether VMware is the issue and expose where the issue may be occurring. This will give initial data to present to the On-Premises network team or hyperscaler vendor to help focus them on their investigation.
The HCX appliances will use the same method of investigation.

Steps:

Execute from the source and destination the traceroute command from within the VM's OS. This will help identify the data path that is to be explored using the packet capture.
It can also expose asymmetric routing issue. The paths should have the same number of hops in both directions.
Command:
Windows: tracert <IP>
Linux: traceroute <IP>
Chose two endpoints that best represent the data path presenting the issue. These endpoints are the source and destination IP addresses to be used for the investigation.
Decide which endpoint will be source and which will be destination. Testing will be done to emulate the faulty connection.
The focus here is to confirm that packets for a given connection are leaving the VM, crossing the vDS and leaving the host's uplink.
Consider this as the source testing that proves the existence and delivery of packets to the On-Premises network infrastructure.

Commands:
pktcap-uw -- switchport <Switchport ID> --capture VnicTx,Vnic Rx --ng -o - |tcpdump-uw -enr - -n
pktcap-uw -- uplink <vmnicX > --capture UplinkSndKernel,UplinkRCVKernel --ng -o - |tcpdump-uw -enr - -n
Prove that the destination is receiving incoming packets from the source and responding.

If the packets have been proven to leave the source and exit the host to the physical network infrastructure, the same principle can be applied on the cloud side.
Determine if the response packets are leaving the VM, the host, and the SDDC.
The packets need to be verified at each step of the data path from the destination VM until leaving the SDDC via the host supporting the NSX Edge device.
Commands:
pktcap-uw -- switchport <Switchport ID> --capture VnicTx,Vnic Rx --ng -o - |tcpdump-uw -enr - -n
pktcap-uw -- uplink <vmnicX > --capture UplinkSndKernel,UplinkRCVKernel --ng -o - |tcpdump-uw -enr - -n

Edge Capture:

>get logical-routers Record VRF of the logical router
>vrf <VRF ID> Typically the T1 or T0 interface capture point will be the default gateway of the destination VM.
>get interfaces|more Record the Interface UUID
>exit
>start capture interface <Interface UUID> direction dual expression <ip | host| ipproto 0xY| mac| port ...>

This capture will prove that the packets get to the destination or not.
These observations will be used to determine where the packet fails to traverse along the data path.
The failure to traverse a hop in the data path is where the issue will be found.
If the packets are all being received and exiting the VMware environment, then the issue must lie somewhere in the Physical infrastructure between the sites.

Present the data and observations that show the focus of the investigation is now on the physical infrastructure.

If the packet captures show that packets are handed off to the physical infrastructure correctly, then the issue is outside of VMware components.

Additional Information

Network Extension External Network L2 Issue

Network Extension External Network L3 Issue