Troubleshooting NSX Edge and Virtual Machine (VM) Performance
searchcancel
Troubleshooting NSX Edge and Virtual Machine (VM) Performance
book
Article ID: 380660
calendar_today
Updated On: 12-19-2024
Products
VMware NSXVMware vSphere ESXi
Issue/Introduction
When troubleshooting NSX Edge/VM performance, a specific set of data must be gathered at the time of the event. This article details what documentation is required and how to gather it prior to opening a support request with Broadcom.
Researching performance issues is time consuming and often difficult. Determining the cause of a performance issue may be easier than rectifying it. Resolution may only come after several incremental changes have been made to the components involved. There is no “one step resolution” to eliminate all performance issues in any given environment.
Points to remember
Don't troubleshoot from physical into virtual. Always troubleshoot from virtual out to physical.
Performance troubleshooting requires evaluating all components involved - ESXi hosts, Edges, VMs, physical switches, routing, etc. VMware by Broadcom only has control of ESXi hosts, Edges, and the VMs on ESXi hosts.
Just because a given statistic has a positive number does not mean there is a problem. Dropped packets happen in ANY and ALL networks that carry TCP traffic. Drops are expected and many protocols are designed to handle them. Evidence of dropped packets IS NOT evidence of a problem. Evidence of a large number of dropped packets in a short period of time or evidence of a large percentage of recently transmitted packets getting dropped is indicative of a problem.
In most cases, VM and Edge VM performance troubleshooting is ESXi host performance troubleshooting. This means many tweaks and adjustments that can be made are executed at the ESXi host level and the improvements are seen at the VM level.
Sometimes, even after all tweaks, configurations, and improvements are put in place, the workload is just too much for the hardware/software to handle. When that is determined, it is time to add hardware, redesign the workflow, or horizontally scale through the addition of new components such as ESXi hosts, Edge VMs, or a move to Bare Metal Edges.
When working with performance issues, it is important to isolate the symptom. The following monitoring can be put in place to narrow down the issue
Monitor performance over layer 2 (VLAN and Geneve) with VMs on the same ESXi host.
Monitor performance over layer 2 (VLAN and Geneve)) with VMs on different ESXi hosts.
Monitor performance East <> West with VMs on the same ESXi host communicating across a T1 DR.
Monitor performance East <> West with VMs on the same ESXi host communicating across a T1 SR.
Monitor performance East <> West with VMs on the different ESXi hosts communicating across a T1 DR.
Monitor performance East <> West with VMs on the different ESXi hosts communicating across a T1 SR.
Monitor performance North <> South with a VM communicating with an external host T0/T1 SR(s).
Factors to determine
Do packets traverse the Distributed Firewall (DFW) for East/West communication?
Can virtual machines be added to the DFW Exclusion List? Does the performance improve or not change?
Do packets traverse the gateway firewall for North/South communication? Can this be temporarily disabled?
If all components are on the same host, is the performance better/the same/worse?
Is a third party firewall (Palo Alto/Checkpoint) or Service Insertion feature (Trend/McAfee) in use?
NSX GUI Investigation
View Network Topology (use Traceflow if traffic is using NSX overlay and/or Edges)
Determine what components (ESXi, NSX, physical) packets traverse.
Determine if NSX service routers are involved at Tier-1 or Tier-0 or both.
Edge Firewall
NAT rules
Is asymmetric routing involved in the data path?
Is ECMP enabled at the Tier-0?
Segments
Verify Quality of Service profiles on Segments are not impacting/limiting traffic.
Networking → Segments → Segment Profiles
The specified bandwidth will allot traffic a maximum inbound and outbound throughput measured in Mbps. In the above example, all VMs on assigned segments will have a maximum throughput of 100 Mbps. This will be detectable via iPerf tests from VM to VM regardless of host and from physical to VM.
Ensure Security Profiles do not block traffic in question.
Use vCenter Performance graphs to investigate individual VM or ESXi host performance.
Investigate CPU/Memory/Network usage - graphs may show spikes of higher than normal traffic. Investigate these spikes at their respective timestamps.
Look for a single CPU that may be over 90% utilization
ESXi host command line investigation
ADF Data - The Automated Diagnostic Framework is a collection of scripts included with NSX VIBs. These scripts gather behavior data for a short period of time while traffic is flowing through the ESXi host. This data gathering does not add any performance overhead while gathering this data, but shows how busy and how much traffic is going through an ESXi host at the time the script is running.
Example command: python nsx_adf_collect.py -i 10 -z esxi_host_1.tar -a -o /vmfs/volumes/datastorename/
Copy and paste to auto-generate hostname: python nsx_adf_collect.py -i 10 -z adf_data_"$HOSTNAME".tar -a -o /vmfs/volumes/datastorename/
Upload this data to Broadcom Support per instructions in the Additional Information section of this KB.
Net-Stats - net-stats is a command native to ESXi and is part of the ADF data collection above. However, should the investigation need to be executed on a host without NSX VIBs installed, the below command is a quick and easy substitute.
Command: net-stats -A -t WwQqihVvh -i 10 > "/tmp/netstats-"$HOSTNAME".txt" 2>/dev/null
Upload this data to Broadcom Support per instructions in the Additional Information section of this KB.
NSX Edge command line investigation
NOTE: These commands are for general troubleshooting / data gathering to determine what workload is actually going through the Edge.
Determine how busy a given datapath CPU is. This command shows a Core number, the TX/RX PPS Count, and % of core utilization.
# get dataplane cpu stats
Determine how busy a given physical nic is. This command shows the fp-eth# identifier and the throughput going over it over the period of time specified.
# get dataplane throughput <interval in seconds>
# get dataplane throughput 10
# get dataplane throughput 10 | json ← prints the output of the command in an easy to read json stack
Determine how much memory is in use for memory dedicated to datapath.
# get dataplane memory stats
Determine the dataplane performance statistics over a specified period of time.
# get dataplane perfstats <interval>
# get dataplane perfstats 10
Determine the physical-port statistics of a given fp-eth# device.
If contacting Broadcom Support about this issue, please provide the following:
Components Involved
VM name
VM IP address
ESXi host
Command from the ESXi host: #python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -i 10 -z <nameoftarfile.tar> -a -o /<path>/<to>/<storage>/<location>
Provide the .tar file that get created when this command completes (will take between 3 and 5 minutes)
Full log bundle
ADF/Performance data
VM/Physical name
IP address
ESX host
Command from the ESXi host: #python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -i 10 -z <nameoftarfile.tar> -a -o /<path>/<to>/<storage>/<location>
Provide the .tar file that get created when this command completes (will take between 3 and 5 minutes)
Full log bundle
ADF/Performance data
Virtual Edges
Command from the ESXi host supporting the Edge(s): #python /opt/vmware/nsx-common/python/adf/nsx_adf_collect.py -i 10 -z <nameoftarfile.tar> -a -o /<path>/<to>/<storage>/<location>
Provide the .tar file that get created when this command completes (will take between 3 and 5 minutes)
NSX Edge log bundle for all Edges involved
ESXi host log bundle for the hosts supporting the Edges
ADF Performance Data
Physical/Bare Metal Edges
NSX Edge log bundle for all Edges involved
Source
Destination
If NSX Edges are involved
Circumstances - Provide the answers to these questions with specific details where possible.
When did the issue begin?
Has the issue been experienced before this event?
Is the performance impact constant or occurring only at certain times?
Is the environment new or has it been in production/working for some time?
Has anything been done prior to the performance event, such as upgrades, maintenance tasks, workload additions, failovers, etc.?
How is the performance impact measured? What applications, tools, or behaviors identify the impact?
Are all virtual machines in an environment impacted? Only a few?
What type of workload do the impacted VMs carry? DB? File server? Web traffic? Etc.
Is there any pattern visible in impacted systems? Only one OS affected? Only a specific subnet? Only a specific datapath?
Is a similar environment available where this symptom is currently not present?
Handling Log Bundles for offline review with Broadcom support