Troubleshooting throughput issues on a VMware VeloCloud SD-WAN Edge
search cancel

Troubleshooting throughput issues on a VMware VeloCloud SD-WAN Edge

book

Article ID: 323778

calendar_today

Updated On:

Products

VMware VeloCloud SD-WAN

Issue/Introduction

Symptoms:
  • The customer experiences low throughput when they pass the traffic using Edge but getting more throughput when sending traffic directly on the Internet link bypassing the Edge device.
  • The speed test results are poor when sending the traffic via the Edge



Environment

VMware VeloCloud SD-WAN Edge / Gateway

Resolution

  1. Before troubleshooting the issue, make sure the Edge model which customer is using supports the expected throughput. Currently, the different Edge models support the throughput as given below, under section "Physical Edge specifications (performance and scale)  

         https://docs.vmware.com/en/VMware-SD-WAN/5.2/VMware-SD-WAN-Administration-Guide/GUID-9943A130-CD6C-4653-AB36-6A396EA8C677.html

        
Note: The throughput capability of the Edge changes based on release, so make sure you're looking at the right release version for your edge. (Drop down is at the top of the page.)
  1. Verify the WAN link bandwidth are measured correctly. If any WAN link bandwidth is configured with manually, verify with the customer that the configured bandwidth is correct.
  2. Verify the MTU settings on the Edge & make sure there are no IP fragmentation issues.
  3. Check that the traffic is using a routed port on the LAN. You may not get full throughput when using switched ports. (A routed port with a vlan tag can have the same functional benefits of a trunk port, but higher throughput.)
  4. If the throughput issue reported is between two Edges ( eg:- Between Edge-1 & Edge-2 ), the throughput can be measured between the Edges using the Iperf utility which is available on the Edge by default. Iperf is a simple network diagnostic tool that can run on Linux or Windows platforms which you install on two endpoints. It uses a client/server model, where traffic is initiated from the client and traverses the network (LAN and/or WAN) to the server. So when testing bandwidth in both directions we’ll need to run the test twice, once in each direction. 
  •  Run the iperf on Edge-1 in Server mode as mentioned below. Here the IP - "192.0.2.1" is the Management IP address of the Edge-1 
                       velocloud  Edge-1~# iperf -B 192.0.2.1 -s
  •  Run the iperf on Edge-2  in client mode as mentioned below. Here the IP address - "192.0.2.2" is the Management IP address of the Edge-2 and IP address -"192.0.2.1" is the management IP address of  Edge-1
                       velocloud Edge-2~# iperf -B 192.0.2.2 -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 443
Binding to local address 192.0.2.2
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.0.2.2 port 443 connected with 192.0.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   184 MBytes   154 Mbits/sec
      • If the customer were running throughput issue for certain applications, We shall check below ifstat commands on lan and wan interfaces of the edge.

                           Source-Edge -> # ifstat -bi Lan interface,wan interface1,wan interface 2

                                      Example # ifstat -bi ge2,ge4,ge5

                          Destination-Edge -> # ifstat -bi Lan interface,wan interface1,wan interface 2

                                      Example # ifstat -bi ge2,ge4,ge5

 
  1. If the throughput issue reported is to an internet destination, verify with the customer whether they can install iperf.  Compare the speed test results for both direct path & "multipath" configurations by configuring business policy.
    • If the throughput issue is observed only in the multipath, verify the following:
      • Any performance issues on Gateway.
      • Check the handoff queue drops.
      • dispcnt to see any drops observed on the gateway.
      • Verify if the Edge geolocation is correct so it can pick the nearest Gateway.
      • Also note this could be UDP throttling by the ISP. If that's the case a UDP based test sent direct will also be affected, but a TCP based test will not. If it is UDP throttling the solution is to ask the ISP to disable it.
      • Also refer to point 5 and 9.
    • Note: Currently, we cannot run the IPer test from the Gateway. So the alternate option is to spin up an iperf server in the AWS on the nearest location & run the iperf test to verify the throughput. If the throughput results are low, check the feasibility to move the primary gateway to nearest gateway and verify the results.
    • If the throughput issue also observed on the direct traffic, the issue might due to WAN link issue. Check with the customer whether they verified the WAN link throughput.
  2.     If the throughput issue is reported only for Multipath traffic, please check any Overlay rate limit is configured, under Business Policy > Additional Settings. If yes, this limit needs to be more than the expected bandwidth.


 
6.  If the throughput is only affecting TCP:
 
  1. Test with multiple streams. Oftentimes when testing with a single TCP stream throughput will get throttled by TCP windowing due to sporadic packet loss. With iperf you can do this easily by using the -P 10  option for 10 parallel streams. (I recommend at least 10.)
  2. If it's not that, there's a known bug, a race condition, that can cause GRO to be enabled on pre-3.4.0 releases. To verify check ethtool -k <int> and see if generic-receive-offload is on. If it is, you're hitting that problem. Look at VLPR-3433 and VLENG-39424.
 
7.  Check if Underlay Accounting is enabled on the Routed interfaces.  This has caused issues in older releases and even newer releases in specific situations.  It could also rate limit traffic while enabled which is expected.  Disabling it doesn't cause a service restart so it's worth disabling as a troubleshooting step.

8.  Check that the routed interface is running DPDK, by ensuring it's listed in the output from "debug.py --dpdk_ports"

9.  If the slowness is on the overlay between edges or edge/VCG, gather the qos_net and qos-link output, this will show info like drops per CoS type and BW cap:
debug.py --qos_net gateway all stats  (From Edge cli)
debug.py --qos_link local stats (From Edge cli)
debug.py --qos_net  (peer-Edge-logicalID)  all stats   (from VCG or VCE)