SD-WAN NSD Troubleshooting and Log Analysis
search cancel

SD-WAN NSD Troubleshooting and Log Analysis

book

Article ID: 303430

calendar_today

Updated On:

Products

VMware VeloCloud SD-WAN

Issue/Introduction

Symptoms:
  • NSD VPN is down.
  • Unable to forward/receive the traffic via the Non SD-WAN site.
  • Unable to access internet using the Zscaler.
  • Traffic intermittently unable to pass over the VPN.
  • Some networks can pass over the VPN but others can't.


Environment

VMware VeloCloud SD-WAN

Resolution

Configuration:

Validate the configuration matches on both sides:

  1. VPN type: route-based vs policy-based and Zscaler (route-based backhaul)
  2. Non-VeloCloud site gateway IP - This is peer gateway IP not VCG IP.
  3. IKE version (We only support IKEv1 as of 2.2)
  4. PSK
  5. Encryption Algorithm 
  6. DH group
  7. PFS
  8. Site subnets (non-velocloud site subnets)
  9. NATT enabled from peer
  10. Local Auth ID. (IP or FQDN which the peer matches for authentication, default is VCG private IP)

 

Troubleshooting

  1. Use tcpdump to find out IKE negotiation is going (port 500 and port 4500 or esp) (Refer Inbound & Outbound Ports & Protocols for VMware SD-WAN)
  2. Make sure port 2426, 500, 4500 is open for the firewall. When the Edge and VCG is having public IP then it uses ESP (IP protocol 50)
  3. Use “debug.py --ike” to check out the VPN state.
  4. Use “debug.py --ike_spd” to check out the VPN settings.
  5. Use “debug.py --ike_sa” to check out Phase-1 IKE negotiation state.
  6. Use “debug.py --ike_childsa” to check out Phase-2 IKE negotiation state.
  7. Use “debug.py --ike_setdebuglevel 8 1” to enable ike log for details. (default is 4)
  8. Examine the /var/log/ike.log for the details and errors if the release is older than 3.3.0. After 3.3.0, these logs are part of gwd.log
  9. You may find the corresponding IKE log by IP address or serverInstance ID or Cookie.
  10. To check specific negotiation - check ike_cky (Phase-1 IKE cookie) or SPI (Phase-2)
  11. IPSec log meanings: (Refer [Internal]Non-VeloCloud IPSec Log Meaning)

 

NOTE:-  Edge and Gateway devices can only initiate VPN negotiations; any negotiation packets initiated by the peer VPN endpoint will be dropped with a "DS lookup failed" error, which is expected behavior. During the initial Security Association (SA) setup, the Edge or Gateway will always act as the initiator. If the peer attempts to initiate SA creation using either IKEv1 or IKEv2, those packets will be dropped.

NOTE:- For troubleshooting CSS/NSD issues, Support should collect mocana/quicksec logs & packet captures along with the diag in issue state.

Procedure to Enable/Disable Mocana logs for IKE/IPSec Debugging

If you are troubleshooting NVS issue live, it is better to enable the MOCANA logs to find out any issues related to NVS. MOCANA debug logs provide very detailed logs which is very useful to find any NVS issues.

- Enable the Mocana logs in the Gateway as below and keep it enabled for 3-4 mins. Note the timestamp as well.

 

--ike_setdebuglevel IKE_SETDEBUGLEVEL [IKE_SETDEBUGLEVEL ...]

                       set ike debug level (0-8) [mocana (0/1)]


/opt/vc/bin/debug.py --ike_setdebuglevel 8 1

 

Disable the Mocana logs with the below commands and take the Gateway diagbundle.

 

/opt/vc/bin/debug.py --ike_setdebuglevel 7 0

MOCANA logs will be part of the INFO logs. So we need to use dbgctl -f | grep "MOC" command if we are debugging live.
If we are checking this from the diagbundle, these logs are available in the diagbundle /velocloud/debug folder. This command works in the Linux and if you are using Mac book, use the attached -dbgctl which can be copied to /velocloud/debug folder in the diagbundle.

dbgctl  -m debugmap -l debugall

Note: MOCANA is available on 5.0 and previous releases (4.x and 3.x). 5.1 introduces QUICKSEC instead of MOCANA and the way to enable the debugs and see the logs changes as well.

Procedure to Enable/Disable QUICKSEC logs for IKE/IPSec Debugging


- Enable the QUICKSEC logs in the Gateway as below and keep it enabled for 3-4 mins. Note the timestamp as well.

debug.py --ike_logger_setlevel 5 5 1 SshEventLoop=3,SshADT*=3


Note: Debugs can also be enabled using --ike_logger_setlevel 10 10 1 SshEventLoop=3,SshADT*=3 (notice level changes from 5 5 1 to 10 10 1). This level of debugging is NOT advised to be enabled in a production environment unless explicitly required by the engineering team.

- Disable the QUICKSEC logs with the below commands and take the Gateway diagbundle.

debug.py --ike_logger_setlevel 3 3 0 SshEventLoop=3,SshADT*=3 

QUICKSEC logs are part of DBGCTL as well as gwd.log. You can check the logs by using grep "QUICKSEC".



For checking the NVS VPN status from the Gateway, Follow the procedures as below.

1) From the Diagbundle, use the gwd.log file and grep for VPN state

 
more gwd.log* | grep "VPN state"


2) /opt/vc/bin/debug.py --ike provides the IKE status  as shown below. You can use the NVS name to filter the NVS as below.

vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike | grep -E "Name|165.225.34.44" | grep -E "Name|Zscaler Central"
Name                                                           Source       Destination   Instance ID   Cookie   TD version   State
Name                               Source       Destination   Instance ID   Cookie                   Type     Local VTI IP      Peer VTI IP   State
Zscaler Central                10.6.10.81     165.225.34.44          1219    0x4c3                Zscaler    169.254.4.173    169.254.4.174      UP



3) /opt/vc/bin/debug.py --ike_sa provides the IKE Phase-1 SA status. This includes the SPI values(highlighted in green),this is a pair for both directions. You can use the Cookie value from the Step#2 to grep the Phase-1 SA for the NVS if there are multiple NVS with the same IP address (very common in the case of Zscaler deployment). Note the SPI pair (Ike Spi/Cookie) in this case. 
 

vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike_sa | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index     IkeSaId   Cookie   IKE      Flags         Dir          NAT                         Ike Spi/Cookie                 PeerAddr     State    Kbytes          Secs
3632     4ab30a2c    0x4c3    v1   0000188d   initiator        local   {621697753bb73cd7 20ec4e3babedba76}       165.225.34.44[4500]    AGGR_I   757.777   84215/86400
vcsupport@vcg81-ustx1:/var/log$


These SPI ({621697753bb73cd7 20ec4e3babedba76} in this case)can be used for checking the corresponding logs from our side as well as peer device logs, in the cases like where we need to find out who deleted the IKE SAs.

4) We can check the Child SA / IPSec SA using the command below and grep with the Cookie Value as below. In the working condition, we should see two entries one for the outbound direction and one for the inbound direction. If any of the child SAs are missing, It can impact the traffic. Not the SPI value which can be used for checking the Child SA logs. 

vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike_ch | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index          Users   Cookie      SpdId    IkeSaId      Flags                  Dir        Spi   PeerPort      Auth     Encr                                            Tunnel // Traffic        Pkts         Kbytes          Secs
18371|18372     0000    0x4c3   80000986   4ab30a2c   00000019   outbound initiator   037a17fa       4500    md5_16     null         10.6.10.81 >> 165.225.34.44 // 0.0.0.0/0 > 0.0.0.0/0    74844642   28264843.942   28755/28800
18372|18371     0001    0x4c3       0985   4ab30a2c   0000001d    inbound initiator   c2dbf105       4500    md5_16     null         10.6.10.81 << 165.225.34.44 // 0.0.0.0/0 < 0.0.0.0/0   106324078   94729801.280   28755/28800
vcsupport@vcg81-ustx1:/var/log$


5) For getting the corresponding ike_spd also, we should use the grep option with the same cookie value as below.

vcsupport@vcg81-ustx1:~$ /opt/vc/bin/debug.py --ike_spd | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index    SpdId   Cookie      Flags        Mode   SecuProto   Auth     Encr                             Tunnel                             Traffic
805       0985    0x4c3   0000000d      Tunnel    ESP_NULL    md5     null     10.6.10.81 <<=>> 165.225.34.44             0.0.0.0/0 <-> 0.0.0.0/0
vcsupport@vcg81-ustx1:~$



Case Study

In this particular case(Refer the case#19076108010), customer reported a completed traffic outage to the non-velocloud site - Zscaler. They disabled and enabled the NVS tunnel from the VCO to restore the traffic. We have a diagbundle collected from the Gateway during the problem state. 

Zscaler IP Address - 165.225.34.44
VeloCloud Gateway - vcg81-ustx1- (Internal IP - 10.6.10.81 / External IP - 104.193.29.81 )

From the diagbundle- COMMAND folder, we could check the IKE status by grepping with NVS IP Address and NVS name. Please note there might be chances that the same IP can be used for multiple NVS tunnels. Note the Instance ID and Cookie for further analysis.

ssujith-a02:COMMANDS ssujith$ more optvcbindebugpy--ike.out.txt | grep -E "Name|165.225.34.44"  | grep -E "Name|Zscaler Central"
Name                                                           Source       Destination   Instance ID    Cookie   TD version   State
Name                               Source       Destination   Instance ID    Cookie                   Type     Local VTI IP      Peer VTI IP   State
Zscaler Central                10.6.10.81     165.225.34.44           942     0x3ae                Zscaler    169.254.4.173    169.254.4.174      UP
Name                Source     Destination   Instance ID   Cookie             Type   State
ssujith-a02:COMMANDS ssujith$



Checked the Child SA for this NVS. Normally we should have Child SA for each direction. We need to filter the ChildSA output using the Cookie and IP address.
Please note the SPI value highlighted - 0ba8aabd. In this case, we have only outbound SPI (See the direction highlighted )
 

ssujith-a02:COMMANDS ssujith$ more optvcbindebugpy--ike_childsa.out.txt  | grep -E "Index|165.225.34.44" | grep -E "Index|0x3ae"
Index               Users    Cookie      SpdId    IkeSaId      Flags                  Dir        Spi   PeerPort      Auth     Encr                                        Tunnel // Traffic       Pkts         Kbytes          Secs
354703|354704        0000     0x3ae   8000075c   44e7b758   00000019   outbound initiator   0ba8aabd       4500    md5_16     null     10.6.10.81 >> 165.225.34.44 // 0.0.0.0/0 > 0.0.0.0/0    2405640      240124.46    5241/28800
ssujith-a02:COMMANDS ssujith$



We can use the SPI value(0ba8aabd)  in our GWD logs to find the corresponding Inbound Child SA. In the highlighted logs below, the SPI values are always displayed in pairs. So In this case the inbound SA SPI value is a042010c

ssujith-a02:log ssujith$ more gwd.log* | grep "ba8aabd"
2019-11-13T03:02:28.348 MSG [VPN] vc_ikeStatHdlr:828 VPN state: CHILDSA_SUCCESS IKEv1_I spd(Id=8000075c) spi=(a042010c 0ba8aabd) IDci: 0.0.0.0/255.255.255.255 IDcr: 0.0.0.0/255.255.255.255 version=1347964697 serverInstance=942
2019-11-13T03:02:28.349 MSG [CRYPTO] update_dc_outSA:247 NEWSA for dc cookie=0x3ae cache_idx=0 spdId=0x8000075c childSaId=354703 spi=0ba8aabd
ssujith-a02:log ssujith$


We can use the Outbound/Inbound SPI values to check the logs from the Peer Device in case if we 
couldn't find anything from the logs.


Additional Information

Route Based versus Policy Based VPNs:

We sometimes see issues where NVS tunnels suddenly start having problems after a VCG upgrade, and is often due to mismatched configurations between the peers, either with lifetimes or policy-based versus route-based VPN types.  Route vs Policy based mismatch will also appear as some networks being able to pass the VPN but others can't.

IPsec doesn't check route versus policy based types during tunnel formation, so the tunnel may still establish, but traffic may or may not pass.  This working or not working has some dependency on which side initiates phase2 of the tunnel.  This can be misleading with the tunnel working at some point, but may encounter problems later on.

Route vs Policy based also isn't an explicit configuration that can be set, but is determined by how the VPN is configured to send traffic over the tunnel and affects the number of child SAs that establish.
Route based VPN's allow all traffic over the VPN, and use a route to direct traffic to the VPN.
Policy based VPN's specify what source/destination networks may communicate over the VPN, and sends traffic matching that over the VPN.

Different vendors use different terms for Policy based source/destination network pairs. Cisco calls it "encryption domains", Palo Alto calls them "Proxy IDs", others call them "traffic selectors".


For phase2 of a route based VPN, we'll see a single route selector with both inbound and outbound 0.0.0.0/0 <>0.0.0.0/0 child SAs.
For phase2 of a policy based VPN, we'll see multiple child SAs matching the source/destination networks: 10.15.50.0/24 < 172.16.40.0/22.
This can be seen on the VCG with "debug.py --ike_childsa". 
When there's a mismatch in types, we could only ensure that they don't come up when the peer initiates, we can't determine that they won't come up when we initiate.

When there's a mismatch in route vs policy based VPNs between the peers, we'll typically see only a single child SA form for one source/destination network pair, and only traffic matching that will pass.  In such cases, the child SA might change between source/destination network pairs coming from the side configured as policy based, so different source/destination network pairs will intermittently work or not work.

In any case when there are issues with a VPN, ensure the configuration matches on both sides, even if it was working fine before with mismatched config.

Example of a child SA for policy-based with a traffic selector of 10.255.255.0/24 and 10.1.16.0/20:

debug.py --ike_childsa | grep x.x.x.x(peer ip)
1014     0002   0x144caf    0025   2696675b   00000015    inbound responder   089a7c71       4500   sha1-hmac       aes-128-cbc               10.6.10.15 << x.x.x.x // 10.255.255.0/24 < 10.1.16.0/20       4588       0       bond0    5160/28800
1854     0002   0x144caf    0026   2696675b   00000011   outbound responder   03da86a7       4500   sha1-hmac       aes-128-cbc               10.6.10.15 >> x.x.x.x // 10.1.16.0/20 > 10.255.255.0/24       6906       0       bond0    5160/28800

Example of a child SA for route based:

debug.py --ike_childsa | grep x.x.x.x(peer ip)
46131|46132     0000      0x2   80000004   75d50a4f   00000019   outbound initiator   ff2f97e8       4500   sha1_20   aes_16   10.4.10.63 >> x.x.x.x // 0.0.0.0/0 > 0.0.0.0/0           90281      54169.673     1556/3600
46131|46132     0000      0x2   80000004   75d50a4f   00000019   inbound  initiator   ff2f97e8       4500   sha1_20   aes_16   10.4.10.63 << x.x.x.x // 0.0.0.0/0 < 0.0.0.0/0           80989      54169.673     1556/3600




 

IKEv2 NSD via edge/gateway, compatibility with Cisco ASR/ISR/CSR routers:

These cisco devices share the same IOS and they default to use Certificates for authentication of IKEv2 tunnels.  The VCG does not so the authentication fails.  Mocana logs will show "ERR_KEY_EXCHANGE".

A possible workaround is adding the command “no crypto ikev2 http-url cert” to the Cisco configuration.  This command HTTP cert support; removing this causes the Cisco device to no longer present the certificate authentication option.  Since this is on customer equipment, the customer will need to confirm it's ok to disable and won't disrupt other IKEv2 tunnels on their device.
Ref: VLENG-75950 and VLPR-6049.


Impact/Risks:
The term Non-Velocloud Site (NVS) is a legacy term for Non SD-WAN Destination (NSD), and can be used interchangeably.