NOTE:- Edge and Gateway devices can only initiate VPN negotiations; any negotiation packets initiated by the peer VPN endpoint will be dropped with a "DS lookup failed" error, which is expected behavior. During the initial Security Association (SA) setup, the Edge or Gateway will always act as the initiator. If the peer attempts to initiate SA creation using either IKEv1 or IKEv2, those packets will be dropped.
NOTE:- For troubleshooting CSS/NSD issues, Support should collect mocana/quicksec logs & packet captures along with the diag in issue state.
Procedure to Enable/Disable Mocana logs for IKE/IPSec Debugging
If you are troubleshooting NVS issue live, it is better to enable the MOCANA logs to find out any issues related to NVS. MOCANA debug logs provide very detailed logs which is very useful to find any NVS issues.
- Enable the Mocana logs in the Gateway as below and keep it enabled for 3-4 mins. Note the timestamp as well.
--ike_setdebuglevel IKE_SETDEBUGLEVEL [IKE_SETDEBUGLEVEL ...] set ike debug level (0-8) [mocana (0/1)] /opt/vc/bin/debug.py --ike_setdebuglevel 8 1
Disable the Mocana logs with the below commands and take the Gateway diagbundle.
/opt/vc/bin/debug.py --ike_setdebuglevel 7 0
MOCANA logs will be part of the INFO logs. So we need to use dbgctl -f | grep "MOC" command if we are debugging live.
If we are checking this from the diagbundle, these logs are available in the diagbundle /velocloud/debug folder. This command works in the Linux and if you are using Mac book, use the attached -dbgctl which can be copied to /velocloud/debug folder in the diagbundle.
dbgctl -m debugmap -l debugall
Note: MOCANA is available on 5.0 and previous releases (4.x and 3.x). 5.1 introduces QUICKSEC instead of MOCANA and the way to enable the debugs and see the logs changes as well.
Procedure to Enable/Disable QUICKSEC logs for IKE/IPSec Debugging
- Enable the QUICKSEC logs in the Gateway as below and keep it enabled for 3-4 mins. Note the timestamp as well.
debug.py --ike_logger_setlevel 5 5 1 SshEventLoop=3,SshADT*=3
Note: Debugs can also be enabled using --ike_logger_setlevel 10 10 1 SshEventLoop=3,SshADT*=3 (notice level changes from 5 5 1 to 10 10 1). This level of debugging is NOT advised to be enabled in a production environment unless explicitly required by the engineering team.
- Disable the QUICKSEC logs with the below commands and take the Gateway diagbundle.
debug.py --ike_logger_setlevel 3 3 0 SshEventLoop=3,SshADT*=3
QUICKSEC logs are part of DBGCTL as well as gwd.log. You can check the logs by using grep "QUICKSEC".
For checking the NVS VPN status from the Gateway, Follow the procedures as below.
1) From the Diagbundle, use the gwd.log file and grep for VPN state
more gwd.log* | grep "VPN state"
2) /opt/vc/bin/debug.py --ike provides the IKE status as shown below. You can use the NVS name to filter the NVS as below.
vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike | grep -E "Name|165.225.34.44" | grep -E "Name|Zscaler Central"
Name Source Destination Instance ID Cookie TD version State
Name Source Destination Instance ID Cookie Type Local VTI IP Peer VTI IP State
Zscaler Central 10.6.10.81 165.225.34.44 1219 0x4c3 Zscaler 169.254.4.173 169.254.4.174 UP
3) /opt/vc/bin/debug.py --ike_sa provides the IKE Phase-1 SA status. This includes the SPI values(highlighted in green),this is a pair for both directions. You can use the Cookie value from the Step#2 to grep the Phase-1 SA for the NVS if there are multiple NVS with the same IP address (very common in the case of Zscaler deployment). Note the SPI pair (Ike Spi/Cookie) in this case.
vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike_sa | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index IkeSaId Cookie IKE Flags Dir NAT Ike Spi/Cookie PeerAddr State Kbytes Secs
3632 4ab30a2c 0x4c3 v1 0000188d initiator local {621697753bb73cd7 20ec4e3babedba76} 165.225.34.44[4500] AGGR_I 757.777 84215/86400
vcsupport@vcg81-ustx1:/var/log$
These SPI ({621697753bb73cd7 20ec4e3babedba76} in this case)can be used for checking the corresponding logs from our side as well as peer device logs, in the cases like where we need to find out who deleted the IKE SAs.
4) We can check the Child SA / IPSec SA using the command below and grep with the Cookie Value as below. In the working condition, we should see two entries one for the outbound direction and one for the inbound direction. If any of the child SAs are missing, It can impact the traffic. Not the SPI value which can be used for checking the Child SA logs.
vcsupport@vcg81-ustx1:/var/log$ /opt/vc/bin/debug.py --ike_ch | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index Users Cookie SpdId IkeSaId Flags Dir Spi PeerPort Auth Encr Tunnel // Traffic Pkts Kbytes Secs
18371|18372 0000 0x4c3 80000986 4ab30a2c 00000019 outbound initiator 037a17fa 4500 md5_16 null 10.6.10.81 >> 165.225.34.44 // 0.0.0.0/0 > 0.0.0.0/0 74844642 28264843.942 28755/28800
18372|18371 0001 0x4c3 0985 4ab30a2c 0000001d inbound initiator c2dbf105 4500 md5_16 null 10.6.10.81 << 165.225.34.44 // 0.0.0.0/0 < 0.0.0.0/0 106324078 94729801.280 28755/28800
vcsupport@vcg81-ustx1:/var/log$
5) For getting the corresponding ike_spd also, we should use the grep option with the same cookie value as below.
vcsupport@vcg81-ustx1:~$ /opt/vc/bin/debug.py --ike_spd | grep -E "Index|165.225.34.44" | grep -E "Index|0x4c3"
Index SpdId Cookie Flags Mode SecuProto Auth Encr Tunnel Traffic
805 0985 0x4c3 0000000d Tunnel ESP_NULL md5 null 10.6.10.81 <<=>> 165.225.34.44 0.0.0.0/0 <-> 0.0.0.0/0
vcsupport@vcg81-ustx1:~$
Case Study
In this particular case(Refer the case#19076108010), customer reported a completed traffic outage to the non-velocloud site - Zscaler. They disabled and enabled the NVS tunnel from the VCO to restore the traffic. We have a diagbundle collected from the Gateway during the problem state.
Zscaler IP Address - 165.225.34.44
VeloCloud Gateway - vcg81-ustx1- (Internal IP - 10.6.10.81 / External IP - 104.193.29.81 )
From the diagbundle- COMMAND folder, we could check the IKE status by grepping with NVS IP Address and NVS name. Please note there might be chances that the same IP can be used for multiple NVS tunnels. Note the Instance ID and Cookie for further analysis.
ssujith-a02:COMMANDS ssujith$ more optvcbindebugpy--ike.out.txt | grep -E "Name|165.225.34.44" | grep -E "Name|Zscaler Central"
Name Source Destination Instance ID Cookie TD version State
Name Source Destination Instance ID Cookie Type Local VTI IP Peer VTI IP State
Zscaler Central 10.6.10.81 165.225.34.44 942 0x3ae Zscaler 169.254.4.173 169.254.4.174 UP
Name Source Destination Instance ID Cookie Type State
ssujith-a02:COMMANDS ssujith$
Checked the Child SA for this NVS. Normally we should have Child SA for each direction. We need to filter the ChildSA output using the Cookie and IP address.
Please note the SPI value highlighted - 0ba8aabd. In this case, we have only outbound SPI (See the direction highlighted )
ssujith-a02:COMMANDS ssujith$ more optvcbindebugpy--ike_childsa.out.txt | grep -E "Index|165.225.34.44" | grep -E "Index|0x3ae" Index Users Cookie SpdId IkeSaId Flags Dir Spi PeerPort Auth Encr Tunnel // Traffic Pkts Kbytes Secs 354703|354704 0000 0x3ae 8000075c 44e7b758 00000019 outbound initiator 0ba8aabd 4500 md5_16 null 10.6.10.81 >> 165.225.34.44 // 0.0.0.0/0 > 0.0.0.0/0 2405640 240124.46 5241/28800 ssujith-a02:COMMANDS ssujith$
We can use the SPI value(0ba8aabd) in our GWD logs to find the corresponding Inbound Child SA. In the highlighted logs below, the SPI values are always displayed in pairs. So In this case the inbound SA SPI value is a042010c
ssujith-a02:log ssujith$ more gwd.log* | grep "ba8aabd" 2019-11-13T03:02:28.348 MSG [VPN] vc_ikeStatHdlr:828 VPN state: CHILDSA_SUCCESS IKEv1_I spd(Id=8000075c) spi=(a042010c 0ba8aabd) IDci: 0.0.0.0/255.255.255.255 IDcr: 0.0.0.0/255.255.255.255 version=1347964697 serverInstance=942 2019-11-13T03:02:28.349 MSG [CRYPTO] update_dc_outSA:247 NEWSA for dc cookie=0x3ae cache_idx=0 spdId=0x8000075c childSaId=354703 spi=0ba8aabd ssujith-a02:log ssujith$
We can use the Outbound/Inbound SPI values to check the logs from the Peer Device in case if we
couldn't find anything from the logs.
Route Based versus Policy Based VPNs:
We sometimes see issues where NVS tunnels suddenly start having problems after a VCG upgrade, and is often due to mismatched configurations between the peers, either with lifetimes or policy-based versus route-based VPN types. Route vs Policy based mismatch will also appear as some networks being able to pass the VPN but others can't.
IPsec doesn't check route versus policy based types during tunnel formation, so the tunnel may still establish, but traffic may or may not pass. This working or not working has some dependency on which side initiates phase2 of the tunnel. This can be misleading with the tunnel working at some point, but may encounter problems later on.
Route vs Policy based also isn't an explicit configuration that can be set, but is determined by how the VPN is configured to send traffic over the tunnel and affects the number of child SAs that establish.
Route based VPN's allow all traffic over the VPN, and use a route to direct traffic to the VPN.
Policy based VPN's specify what source/destination networks may communicate over the VPN, and sends traffic matching that over the VPN.
Different vendors use different terms for Policy based source/destination network pairs. Cisco calls it "encryption domains", Palo Alto calls them "Proxy IDs", others call them "traffic selectors".
For phase2 of a route based VPN, we'll see a single route selector with both inbound and outbound 0.0.0.0/0 <>0.0.0.0/0 child SAs.
For phase2 of a policy based VPN, we'll see multiple child SAs matching the source/destination networks: 10.15.50.0/24 < 172.16.40.0/22.
This can be seen on the VCG with "debug.py --ike_childsa".
When there's a mismatch in types, we could only ensure that they don't come up when the peer initiates, we can't determine that they won't come up when we initiate.
When there's a mismatch in route vs policy based VPNs between the peers, we'll typically see only a single child SA form for one source/destination network pair, and only traffic matching that will pass. In such cases, the child SA might change between source/destination network pairs coming from the side configured as policy based, so different source/destination network pairs will intermittently work or not work.
In any case when there are issues with a VPN, ensure the configuration matches on both sides, even if it was working fine before with mismatched config.
Example of a child SA for policy-based with a traffic selector of 10.255.255.0/24 and 10.1.16.0/20:
debug.py --ike_childsa | grep x.x.x.x(peer ip) 1014 0002 0x144caf 0025 2696675b 00000015 inbound responder 089a7c71 4500 sha1-hmac aes-128-cbc 10.6.10.15 << x.x.x.x // 10.255.255.0/24 < 10.1.16.0/20 4588 0 bond0 5160/28800 1854 0002 0x144caf 0026 2696675b 00000011 outbound responder 03da86a7 4500 sha1-hmac aes-128-cbc 10.6.10.15 >> x.x.x.x // 10.1.16.0/20 > 10.255.255.0/24 6906 0 bond0 5160/28800
Example of a child SA for route based:
debug.py --ike_childsa | grep x.x.x.x(peer ip) 46131|46132 0000 0x2 80000004 75d50a4f 00000019 outbound initiator ff2f97e8 4500 sha1_20 aes_16 10.4.10.63 >> x.x.x.x // 0.0.0.0/0 > 0.0.0.0/0 90281 54169.673 1556/3600 46131|46132 0000 0x2 80000004 75d50a4f 00000019 inbound initiator ff2f97e8 4500 sha1_20 aes_16 10.4.10.63 << x.x.x.x // 0.0.0.0/0 < 0.0.0.0/0 80989 54169.673 1556/3600
IKEv2 NSD via edge/gateway, compatibility with Cisco ASR/ISR/CSR routers:
These cisco devices share the same IOS and they default to use Certificates for authentication of IKEv2 tunnels. The VCG does not so the authentication fails. Mocana logs will show "ERR_KEY_EXCHANGE".
A possible workaround is adding the command “no crypto ikev2 http-url cert” to the Cisco configuration. This command HTTP cert support; removing this causes the Cisco device to no longer present the certificate authentication option. Since this is on customer equipment, the customer will need to confirm it's ok to disable and won't disrupt other IKEv2 tunnels on their device.
Ref: VLENG-75950 and VLPR-6049.