Datapath capture to Diagnose Datapath Connectivity Issues in NSX Environments
search cancel

Datapath capture to Diagnose Datapath Connectivity Issues in NSX Environments

book

Article ID: 406414

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

This article provides a structured approach to collect essential data, logs, and packet captures when experiencing datapath issues such as ping loss, intermittent connectivity, latency, or network disconnects affecting virtual machines in an NSX-T environment.

Purpose:
To effectively diagnose and isolate network connectivity problems, particularly those affecting a subset of virtual machines.

These captures are Essential for Broadcom to perform diagnosis on the Datapath connectivity issue. 

Environment

VMware NSX-T Data Center
VMware NSX

Resolution

Stage 1: Initial Network Trace (NSX UI)

Perform a Traceflow from the NSX Manager UI. This initial step helps visualize the logical path and identify potential drops or misconfigurations within the NSX overlay.

Stage 2: Logical Connectivity Diagnostic Tests

Perform the following tests from the impacted virtual machines to help isolate the problem area.

  1. Ping Tests (from Affected VM):

    • Same Network, Same Host: Ping another VM on the same logical segment, on the same ESXi host.
    • Same Network, Different Host: Ping another VM on the same logical segment, on a different ESXi host.
    • Different Networks, Same Host: Ping a VM on a different logical segment, on the same ESXi host.
    • Different Networks, Different Hosts: Ping a VM on a different logical segment, on a different ESXi host.
  2. Gateway and Internet Reachability (from Affected VM):

    • Ping the default gateway configured for the affected VM.
    • Ping a known reliable external IP address (e.g., 8.8.8.8) to validate North-South connectivity.
  3. Traceroute & Host-Level Data (from Affected VM & ESXi Host):

    • Run a traceroute from the affected VM to the destination IP.
    • Generate a Tech Support Bundle from the ESXi host during the issue.
    • Collect the host's ARP table:
      #esxcli network ip neighbor list
    • Note: If uptime is critical, consider migrating the majority of affected VMs to other hosts, retaining only a few non-critical VMs on the problematic host for continued diagnostics.

Stage 3: In-Depth Host Diagnostics

Before migrating all VMs off the host, perform host-level diagnostics if the issue is suspected to be host-specific.

  1. Prepare the Host:

    • Set DRS to Manual for the affected host's cluster to prevent VMs from migrating onto it inadvertently.
  2. VM Switch Port Mapping:

    • Collect VM switch port details:
      #net-stats -l
  3. Live Kernel Dump:

    • Perform a live kernel dump (use with caution, for severe host issues only):
      #localcli --plugin-dir /usr/lib/vmware/esxcli/int debug livedump perform
  4. Esxtop Data Collection:

    • Collect Esxtop data for 2 minutes (60 iterations, 2-second delay):
      #/usr/sbin/esxtop -b -d 2 -n 60 > /vmfs/volumes/<Volume_ID>/$(hostname)_$(date +"%Y_%m_%d_%I_%M_%p").csv
      Replace <Volume_ID> with your actual datastore ID.

Stage 4: VM and Network Analysis

If most VMs have been migrated, continue analysis with a few non-critical VMs remaining on the host.

  1. Console-Level Tests (from Affected VM's Console via vSphere UI):

    • Ping external IPs (e.g., 8.8.8.8).
    • Run traceroute to a destination IP.
    • Check for blocked ports:
      #net-dvs -l | grep -E "port |port.block|volatile.vlan|volatile.status"
  2. Identify Switchport and Uplink Info (from ESXi host SSH):

    • Using nsxdp-cli:
      #vswitch instance list
    • Using esxcli (Alternate method):
      #esxcli network vm list
      Get World ID esxcli network vm port list -w <world_ID>
      Use the World ID from above
    • Note down: Port ID, DVPort ID, Uplink MAC Address.

Stage 5: Packet Capture Workflow

Packet captures are crucial for deep-dive network analysis. Ensure timestamp correlation for all captures.

  1. Pre-Capture Setup:

    • Start a continuous ping from a working VM to the affected VM.
    • Start a background, continuous ping from the affected VM to another VM on a different host.
  2. Guest VM Packet Captures:

    • From Working Source VM (Guest OS):
      #tcpdump -i <eth_interface_name> -nn host <target_IP> -w /<filesystem>/Source_Guest_OS.pcap
    • From Affected VM (Guest OS):
      ping <target_VM_IP> & # Run in background
      #tcpdump -i <eth_interface_name> -nn -w /<filesystem>/Guest_OS.pcap
  3. ESXi Host Packet Captures:

    • At vNIC (VM Network Interface Card):
      #pktcap-uw --switchport <SwitchportID> --capture VnicTx,VnicRx -o /vmfs/volumes/<datastore>/<VM>_VnicTxRx.pcap
    • At vSwitch Port (to/from vNIC):
      #pktcap-uw --switchport <SwitchportID> --ng -o /vmfs/volumes/<datastore>/<VM>-VswitchTxExit.pcap pktcap-uw --switchport <SwitchportID> --dir 1 --ng -o /vmfs/volumes/<datastore>/<VM>-VswitchRx.pcap
    • Specific Source IP Capture:
      #pktcap-uw --switchport <SwitchportID> --srcip <src_IP> --capture VnicRx --ng -o /vmfs/volumes/<datastore>/<SRC_IP>_<DST_IP>-VnicRxEntry.pcap
    • At Uplink (Port Output/Input):
      #pktcap-uw --uplink <vmnic#> --capture PortOutput,PortInput -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_PortIO.pcap
    • At Physical NIC Entry/Exit (Kernel Level):
      #pktcap-uw --uplink <vmnic#> --capture UplinkRcvKernel,UplinkSndKernel -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_VmnicRxTx.pcap
    • At vSwitch Dispatch Stage:
      #pktcap-uw --uplink <vmnic#> --dir 0 --stage 1 -o /vmfs/volumes/<datastore>/<VM>_<vmnic#>_EtherswitchDispatch.pcap

Additional Information