Datapath capture to Diagnose Datapath Connectivity Issues in NSX Environments

Products

VMware NSX

Issue/Introduction

This article provides a structured approach to collect essential data, logs, and packet captures when experiencing datapath issues such as ping loss, intermittent connectivity, latency, or network disconnects affecting virtual machines in an NSX-T environment.

Purpose:
To effectively diagnose and isolate network connectivity problems, particularly those affecting a subset of virtual machines.

These captures are Essential for Broadcom to perform diagnosis on the Datapath connectivity issue.

Environment

VMware NSX-T Data Center
VMware NSX

Resolution

Stage 1: Initial Network Trace (NSX UI)

Perform a Traceflow from the NSX Manager UI. This initial step helps visualize the logical path and identify potential drops or misconfigurations within the NSX overlay.

Reference: VMware NSX Traceflow Documentation

Stage 2: Logical Connectivity Diagnostic Tests

Perform the following tests from the impacted virtual machines to help isolate the problem area.

Ping Tests (from Affected VM):
- Same Network, Same Host: Ping another VM on the same logical segment, on the same ESXi host.
- Same Network, Different Host: Ping another VM on the same logical segment, on a different ESXi host.
- Different Networks, Same Host: Ping a VM on a different logical segment, on the same ESXi host.
- Different Networks, Different Hosts: Ping a VM on a different logical segment, on a different ESXi host.
Gateway and Internet Reachability (from Affected VM):
- Ping the default gateway configured for the affected VM.
- Ping a known reliable external IP address (e.g., 8.8.8.8) to validate North-South connectivity.
Traceroute & Host-Level Data (from Affected VM & ESXi Host):
- Run a traceroute from the affected VM to the destination IP.
- Generate a Tech Support Bundle from the ESXi host during the issue.
- Collect the host's ARP table:
  #esxcli network ip neighbor list
- Note: If uptime is critical, consider migrating the majority of affected VMs to other hosts, retaining only a few non-critical VMs on the problematic host for continued diagnostics.

Stage 3: In-Depth Host Diagnostics

Before migrating all VMs off the host, perform host-level diagnostics if the issue is suspected to be host-specific.

Prepare the Host:
- Set DRS to Manual for the affected host's cluster to prevent VMs from migrating onto it inadvertently.
VM Switch Port Mapping:
- Collect VM switch port details:
  #net-stats -l
Live Kernel Dump:
- Perform a live kernel dump (use with caution, for severe host issues only):
  #localcli --plugin-dir /usr/lib/vmware/esxcli/int debug livedump perform
Esxtop Data Collection:
- Collect Esxtop data for 2 minutes (60 iterations, 2-second delay):
  #/usr/sbin/esxtop -b -d 2 -n 60 > /vmfs/volumes/<Volume_ID>/$(hostname)_$(date +"%Y_%m_%d_%I_%M_%p").csv
  Replace <Volume_ID> with your actual datastore ID.

Stage 4: VM and Network Analysis

If most VMs have been migrated, continue analysis with a few non-critical VMs remaining on the host.

Console-Level Tests (from Affected VM's Console via vSphere UI):
- Ping external IPs (e.g., 8.8.8.8).
- Run traceroute to a destination IP.
- Check for blocked ports:
  #net-dvs -l | grep -E "port |port.block|volatile.vlan|volatile.status"
Identify Switchport and Uplink Info (from ESXi host SSH):
- Using nsxdp-cli:
  #vswitch instance list
- Using esxcli (Alternate method):
  #esxcli network vm list Get World ID esxcli network vm port list -w <world_ID> Use the World ID from above
- Note down: Port ID, DVPort ID, Uplink MAC Address.

Stage 5: Packet Capture Workflow

Packet captures are crucial for deep-dive network analysis. Ensure timestamp correlation for all captures.

Pre-Capture Setup:
- Start a continuous ping from a working VM to the affected VM.
- Start a background, continuous ping from the affected VM to another VM on a different host.
Guest VM Packet Captures:
- From Working Source VM (Guest OS):
  #tcpdump -i <eth_interface_name> -nn host <target_IP> -w /<filesystem>/Source_Guest_OS.pcap
- From Affected VM (Guest OS):
  ping <target_VM_IP> & # Run in background #tcpdump -i <eth_interface_name> -nn -w /<filesystem>/Guest_OS.pcap
ESXi Host Packet Captures:
- At vNIC (VM Network Interface Card):
  #pktcap-uw --switchport <SwitchportID> --capture VnicTx,VnicRx -o /vmfs/volumes/<datastore>/<VM>_VnicTxRx.pcap
- At vSwitch Port (to/from vNIC):
  #pktcap-uw --switchport <SwitchportID> --ng -o /vmfs/volumes/<datastore>/<VM>-VswitchTxExit.pcap pktcap-uw --switchport <SwitchportID> --dir 1 --ng -o /vmfs/volumes/<datastore>/<VM>-VswitchRx.pcap
- Specific Source IP Capture:
  #pktcap-uw --switchport <SwitchportID> --srcip <src_IP> --capture VnicRx --ng -o /vmfs/volumes/<datastore>/<SRC_IP>_<DST_IP>-VnicRxEntry.pcap
- At Uplink (Port Output/Input):
  #pktcap-uw --uplink <vmnic#> --capture PortOutput,PortInput -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_PortIO.pcap
- At Physical NIC Entry/Exit (Kernel Level):
  #pktcap-uw --uplink <vmnic#> --capture UplinkRcvKernel,UplinkSndKernel -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_VmnicRxTx.pcap
- At vSwitch Dispatch Stage:
  #pktcap-uw --uplink <vmnic#> --dir 0 --stage 1 -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_EtherswitchDispatch.pcap

Additional Information

References & Further Reading:

VMware NSX Traceflow Documentation:
https://techdocs.broadcom.com/us/en/vmware-cis/nsx/nsxt-dc/3-2/administration-guide/network-monitoring/perform-a-traceflow.html
Capturing a Network Trace in ESXi using tcpdump-uw:
https://knowledge.broadcom.com/external/article/311229/capturing-a-network-trace-in-esxi-using.html
Using the pktcap-uw tool in ESXi 5.5 and later:
https://knowledge.broadcom.com/external/article/341568/using-the-pktcapuw-tool-in-esxi-55-and-l.html
Troubleshooting VMware NSX-T using Packet Captures:
https://knowledge.broadcom.com/external/article/345925/troubleshooting-vmware-nsxt-using-packet.html