VMware SD-WAN Cloud Security Service IPsec Troubleshooting
search cancel

VMware SD-WAN Cloud Security Service IPsec Troubleshooting

book

Article ID: 320550

calendar_today

Updated On:

Products

VMware SD-WAN by VeloCloud

Issue/Introduction

The Cloud Security Service (CSS) feature allows for GRE and IPSEC tunnels to be established directly from the VMware SD-WAN Edge to a CSS Point-of-Presence (PoP)/Server, instead of through the VMware SD-WAN Gateway.

This article will focus on IPSEC CSS tunnel troubleshooting from the Edge. Configuring a CSS can allow for more throughput (more tunnels), less latency (selecting closer PoPs/Servers), and more flexibility across the enterprise.


Symptoms:

A user may report one of the following behaviors while using a Cloud Security Service (CSS) with the VMware SD-WAN solution:

  • User reports that the Cloud Security Service (CSS) Site Tunnels are down--i.e. traffic is not passing through the CSS.
  • CSS tunnels display as "DOWN" or "PENDING" on the VMware SD-WAN Orchestrator.
  • Slow throughput speeds observed when using CSS Tunnels.
  • Users reporting "Zscaler is not working."  Meaning that traffic meant for a Zscaler Point-of-Presence is not reaching because the tunnel is down.
  • CSS tunnels flapping--i.e. tunnels periodically shifting from UP, to DOWN, and then to UP again in quick succession.
  • VCO events show "Edge Direct IPsec tunnel down", even though it's up and passing traffic.
  • Monitor->Network Services displays "DOWN Status" for CSS: 

 


Environment

VMware SD-WAN by VeloCloud

Resolution

I. Cloud Security Services (CSS) Tunnels are down

1. On the Orchestrator UI, please confirm that all settings for the Edge and Profile are correct--including IKE version, credentials, encryption etc:

For an IPsec tunnel to form, it is necessary that these settings match on both sides (the CSS provider and the Orchestrator) and that the correct CSS provider is selected. The IKE (Internet Key Exchange) needs to negotiate these settings for the relationship to form. If any settings are mismatched (encryption, hash, etc.) the tunnel will not form. 


 

2.     The Edge may have reached the limit for tunnels.  Many times if the feature Dynamic Edge-to-Edge (E2E) is enabled, this issue may occur. Ensure that the tunnel count remains under the limit for the Edge model in question. You can view current the tunnel count by checking the "System" tab on the Monitor->Edge Overview page of the Orchestrator. 

The tunnel capacity varies for each Edge mode and there can be differences in capacity if the customer is using an older 2.x version of the Edge software.

Please consult the VMware SD-WAN Edge Platform Specifications for tunnel capacities for your Edge model.  Note: Edge 500 is not listed but has the same capacity as an Edge 520.
 

3.   If the CSS has its PoP's/Servers configured with a hostname versus an IP address, try an IP address instead. Doing so may bring the tunnel back into a working state.

***Please note that the Symantec CSS requires the configuration of two FQDN/PSK for each WAN link ***

4. If you have verified the configurations and the tunnel counts are under the Edge model limit, please collect the following information and raise a case with support: 

a. Was the tunnel(s) previously working or is this a new configuration? 

b. Time stamp when the tunnel went down.

c. Is the tunnel flapping or does it remain down?

d. Is there any difference in behavior when using a hostname vs an IP address for the CSS PoP's/Server configurations?
 

II. Performance degradation for traffic using a Cloud Security Service

A user may observe throughput performance through a CSS is lower than expected.

By design, the Edge acts as a pass-through device for the CSS. To rule out the possibility that the Edge is the source of the lower traffic throughput, we must first verify if the Edge is passing traffic correctly.

When using IPSEC there is an increased packet overhead for both encryption and decryption. Speeds may be up to 5-10% lower due to the encryption/decryption. 

Note: For more information on tunnel overhead please consult Tunnel Overhead and MTU documentation.

For Zscaler: There is a tunnel soft tunnel limit of 250mb/s .

   1.  The first step to establishing if the Edge is the source of throughput performance issues is to conduct speed tests through each of the following  paths: 

a. Direct (No Gateway used.)
b. Multi-path (Gateway used.)
c. CSS backhaul 
 

2.      If the speedtest results are similar and at the expected rate through Direct and Multipath, this typically indicates there is no problem with how the Edge is processing traffic.
 

3.     Check the throughput of the Edge to see if there is any possibility the tunnel is over-utilized. You can monitor this on the Orchestrator using Monitor->Transport and selecting for "Average Throughput".
 

4.     We can check the following on the Edge to ensure there are no issues on the Edge itself. 

a.      CPU, Memory, Tunnel Count (Within Edge specifications), average throughput. You can view this on the Monitor->Edge page under the "System" tab in Release 3.3.x and higher.

b.      Check interface status to ensure there are not an excessive number of drops, collisions, etc. This may be done using the Remote Diagnostic "Interface Status" on the Orchestrator. 

c.      Check handoff queue drops when slow speeds are encountered to ensure these are not increasing. This may be viewed on the Monitor->Edge page under the "System" tab for Release 3.3.x and higher.

d.     Take packet captures on the LAN interface when routing traffic through the CSS and check for re-transmits. If re-transmits are encountered this likely indicates loss along the path if the TCP window is not scaling up.

g.     If possible, execute an IPERF test to an available IPERF server in the same data center as the CSS end point.  You can also utilize an IPERF test to a public destination across the CSS service and compare TCP/UDP. 
 

      5.  If you are still unable to achieve desired performance please engage with your CSS provider and: 

a.      Attempt to change nodes/end points.

b.      Ensure no issues are being experienced on their nodes/end points.

c.      Verify the amount of traffic being received.

d.      Ensure the traffic is not surpassing 250mb/s limit.

 

       6. If loss can be confirmed using one of the above methods, the ISP should be engaged as well to check for any issues along the path. 


 

Orchestrator events show "Edge Direct IPsec tunnel down", even though it's up and passing traffic.

The edge attempts to use any enabled/connected L3 interface with WAN Overlay enabled, as long as it has a public IP.  Even a situation where an L3 interface is enabled, connected, has WAN Overlay enabled, has DHCP enabled but yet to receive an IP on it, it will consider this interface to try using for CSS even though there isn't an IP on it yet.  So leaving an L3 interface connected and enabled with default settings will cause it to try using for CSS and can result in this error.
If WAN Overlay is disabled on the interface, a service restart is required to get it to stop using that interface for CSS.