General troubleshooting guide for heavy link flap
search cancel

General troubleshooting guide for heavy link flap

book

Article ID: 388293

calendar_today

Updated On:

Products

VMware VeloCloud SD-WAN

Issue/Introduction

It's quite usual to observe link dead/up event on VCO mostly due to ISP link flap. However, if frequent link flap is observed, customer may need to check further to identify the root cause.

Environment

VMware VeloCloud SD-WAN Edge

Cause

"Link Dead" event is generated only when all tunnels are dead on a specific link. If any tunnel is reestablished, VCE generates "Link Alive" event.

Resolution

In this article, rolling capture is used to troubleshooting why all tunnels are dead on a specific link.As a link may have dozens of VCMP peers and there may be production running on the VCMP tunnel, the packet capture size could be very big. The best practice is to capture the VCMP tunnel between the link and secondary gateway. There is basically no production running on that tunnel so the capture file size is very disk space-saving. Consider the nature of "Link Dead" event, tunnel between the link and secondary gateway can represent all other VCMP tunnels on this specific link.

Rolling Captures:

Just like logs fill up to a certain size and roll over customer can do the same thing with a vctcpdump. This is a capture customer can leave running for long periods of time when trying to capture a very intermittent issue. The basic command will look like this:

nohup vctcpdump -i <Interface> -nni host <Sec GW IP> -C 25 -W 4 -w /velocloud/log/GE<x>link_flap.pcap &

Break this one down piece by piece as customer will be able to modify it to match different scenarios:


nohup        -> this means that when you exit your ssh session it won't send a SIGHUP to your tcpdump process killing it. (i.e. your tcpdump session won't die when you close putty)
-i          ->  choose the problematic WAN interface.
host x.x.x.x -> enter secondary gateway IP after host
-C 25        ->  this option decides how many Mbs each file will contain. 25 Mbs in this example.
-W 4         ->  this option decides how many files will be rolled over. So these last 2 options together means it will fill up 4 files of 25 Mbs each -> maximum 100 Mbs of disk space
-w /velocloud/log/GE<x>link_flap.pcap  ->  This writes the output to the given file. Choose /velocloud/log here as that directory usually has a lot of free space and customer can collect the pcaps by triggering a diag bundle from the VCO
&            ->  this causes the process to run in the background

 

Real Live Example:

GE5 flaps frequently. Rolling capture command would be:

nohup vctcpdump -i ge5 -nni host <Sec GW IP> -C 25 -W 4 -w /velocloud/log/GE5link_flap.pcap &

Once rolling capture is initiated, wait for the next link flap. Once a new link flap is observed, kill the vctcpdump process. Check if it's running via ps aux | grep -i vctcpdump. Use kill -9 <pid> to terminate the process.Collect the pcaps by triggering a diag bundle from the VCO. In this example, GE5 was down at 11:17:31 UTC+8 on Feb 12nd. Checking capture:

Based on above capture, SD-WAN edge's behavior was expected and correct. When tunnel was dead, it kept trying to reestablish the tunnel but no response. As GE5 has 50+ SD-WAN peers, it's very likely the issue is on the local site. Above capture proves issue is not on the VeloCloud Edge but somewhere else in the local site. After customer replaced the cable and rebooted ISP device, GE5 has not flapped anymore.

Additional Information

Make sure to delete the captures off the edge after killing the process and pulling the diag bundle. Otherwise they'll stay there forever taking up disk space and inflating diag bundles.