Researching performance issues is time consuming and often difficult. Determining the cause of a performance issue may be easier than rectifying it. Resolution may only come after several incremental changes have been made to the components involved. There is no “one step resolution” to eliminate all performance issues in any given environment.
Points to remember
- Don't troubleshoot from physical into virtual. Always troubleshoot from virtual out to physical.
- Performance troubleshooting requires evaluating all components involved - ESXi hosts, Edges, VMs, physical switches, routing, etc. VMware by Broadcom only has control of ESXi hosts, Edges, and the VMs on ESXi hosts.
- Just because a given statistic has a positive number does not mean there is a problem. Dropped packets happen in ANY and ALL networks that carry TCP traffic. Drops are expected and many protocols are designed to handle them. Evidence of dropped packets IS NOT evidence of a problem. Evidence of a large number of dropped packets in a short period of time or evidence of a large percentage of recently transmitted packets getting dropped is indicative of a problem.
- In most cases, VM and Edge VM performance troubleshooting is ESXi host performance troubleshooting. This means many tweaks and adjustments that can be made are executed at the ESXi host level and the improvements are seen at the VM level.
- Sometimes, even after all tweaks, configurations, and improvements are put in place, the workload is just too much for the hardware/software to handle. When that is determined, it is time to add hardware, redesign the workflow, or horizontally scale through the addition of new components such as ESXi hosts, Edge VMs, or a move to Bare Metal Edges.
When working with performance issues, it is important to isolate the symptom. The following monitoring can be put in place to narrow down the issue
- Monitor performance over layer 2 (VLAN and Geneve) with VMs on the same ESXi host.
- Monitor performance over layer 2 (VLAN and Geneve)) with VMs on different ESXi hosts.
- Monitor performance East <> West with VMs on the same ESXi host communicating across a T1 DR.
- Monitor performance East <> West with VMs on the same ESXi host communicating across a T1 SR.
- Monitor performance East <> West with VMs on the different ESXi hosts communicating across a T1 DR.
- Monitor performance East <> West with VMs on the different ESXi hosts communicating across a T1 SR.
- Monitor performance North <> South with a VM communicating with an external host T0/T1 SR(s).
Factors to determine
- Do packets traverse the Distributed Firewall (DFW) for East/West communication?
- Can virtual machines be added to the DFW Exclusion List? Does the performance improve or not change?
- Do packets traverse the gateway firewall for North/South communication? Can this be temporarily disabled?
- If all components are on the same host, is the performance better/the same/worse?
- Is a third party firewall (Palo Alto/Checkpoint) or Service Insertion feature (Trend/McAfee) in use?
NSX GUI Investigation
- View Network Topology (use Traceflow if traffic is using NSX overlay and/or Edges)
- Determine what components (ESXi, NSX, physical) packets traverse.
- Determine if NSX service routers are involved at Tier-1 or Tier-0 or both.
- Edge Firewall
- NAT rules
- Is asymmetric routing involved in the data path?
- Is ECMP enabled at the Tier-0?
- Segments
- Verify Quality of Service profiles on Segments are not impacting/limiting traffic.
-
- Networking → Segments → Segment Profiles

- The specified bandwidth will allot traffic a maximum inbound and outbound throughput measured in Mbps. In the above example, all VMs on assigned segments will have a maximum throughput of 100 Mbps. This will be detectable via iPerf tests from VM to VM regardless of host and from physical to VM.
- Ensure Security Profiles do not block traffic in question.
- Use vCenter Performance graphs to investigate individual VM or ESXi host performance.
-
-
- Investigate CPU/Memory/Network usage - graphs may show spikes of higher than normal traffic. Investigate these spikes at their respective timestamps.
- Look for a single CPU that may be over 90% utilization
ESXi host command line investigation
- ADF Data - The Automated Diagnostic Framework is a collection of scripts included with NSX VIBs. These scripts gather behavior data for a short period of time while traffic is flowing through the ESXi host. This data gathering does not add any performance overhead while gathering this data, but shows how busy and how much traffic is going through an ESXi host at the time the script is running.
The ADF collector can be run in interactive mode (doesn't release the prompt before completion of the collection) or daemon mode with the -d
option (runs in the background until completion of the collection).
- Use the
-z
option to pack and compress all the samples into a single file at the end of the collection.
Using $HOSTNAME
like in the example commands allows to include the hostname of the ESXi host in the filename. This is particularly useful when collecting ADF data on multiple ESXi hosts.
- Use the
-a
option for advanced collection (more data).
- Use the
-o
option to set the storage location for the samples and final blob. The directory is automatically created if necessary.
- Use the
-i
option to set the interval (sleeping time at the end of a sample before starting the next).
- In interactive mode:
- This would be the preferred option if the number of samples collected is of importance.
- Use the
-x
option to set the number of samples collected (default: 1). Collecting a couple of samples is recommended for performance analysis.
- Command (to customize):
python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -o /<path>/<to>/<storage>/<location> -z <name_of_file.tgz> -a -x <number_of_samples>
- Typical command:
python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -o /vmfs/volumes/mydatastore/ADF/ -z "ADF_$HOSTNAME.tgz" -a -x 5
- In daemon mode:
- This would be the preferred option if the duration of the analysis is of importance.
Use the -t
option to set the overall runtime (default: 300). With the interval, this will determine how many samples are collected.
- Command (to customize):
python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -o /<path>/<to>/<storage>/<location> -z <name_of_file.tgz> -a -d -i <time_interval> -t <overall_time>
- Typical command: p
ython /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -o /vmfs/volumes/mydatastore/ADF/ -z "ADF_$HOSTNAME.tgz" -a -d
- If multiple runs of the above command are needed for an intermittent performance issue, then including the timestamp in the filename using -$(date +"%Y_%m_%d_%I_%M_%p") is useful for identifying the file which aligns with the time of a reproduction. Typical example of this command is:
python
/opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -o /vmfs/volumes/mydatastore/ADF/ -z adf_data_"$HOSTNAME"-$(date +"%Y_%m_%d_%I_%M_%p").tar -a -d
- Upload this data to Broadcom Support per instructions in the Additional Information section of this KB.
- Net-Stats - net-stats is a command native to ESXi and is part of the ADF data collection above. However, should the investigation need to be executed on a host without NSX VIBs installed, the below command is a quick and easy substitute.
- Command:
net-stats -A -t WwQqihVvh -i 10 > "/tmp/netstats-"$HOSTNAME".txt" 2>/dev/null
- Upload this data to Broadcom Support per instructions in the Additional Information section of this KB.
- The following commands can be used to confirm the RSS or DRSS configuration and how many RSS engines are supported on the host the Edge VM is running on. If a single RSS engine is supported or configured, then that can be a potential source of latency for traffic destined and originated from the Edge:
- vsish -e get /net/pNics/vmnic(x)/rxqueues/info
- vsish -e get /net/pNics/vmnic0/rxqueues/queues/0/rss/engineInfo
- vsish -e get /net/pNics/vmnic0/rxqueues/queueCount
- esxcli system module parameters list -m <driver-name>
NSX Edge command line investigation
NOTE: These commands are for general troubleshooting / data gathering to determine what workload is actually going through the Edge.
- Determine how busy a given datapath CPU is. This command shows a Core number, the TX/RX PPS Count, and % of core utilization.
# get dataplane cpu stats
- Determine how busy a given physical nic is. This command shows the fp-eth# identifier and the throughput going over it over the period of time specified.
# get dataplane throughput <interval in seconds>
# get dataplane throughput 10
# get dataplane throughput 10 | json
← prints the output of the command in an easy to read json stack
- Determine how much memory is in use for memory dedicated to datapath.
# get dataplane memory stats
- Determine the dataplane performance statistics over a specified period of time.
-
# get dataplane perfstats <interval>
-
# get dataplane perfstats 10
-
Determine the physical-port statistics of a given fp-eth# device.
-
# get physical-port fp-eth# stats (verbose)
The same data as above plus more can be automatically gathered using the Edge Datapath Stats Collection script for 3.x, 4.0 and 4.1. Upload this data to Broadcom Support per instructions in the Additional Information section of this KB.
Known Issues