Packet captures for VMware VeloCloud SD-WAN
search cancel

Packet captures for VMware VeloCloud SD-WAN

book

Article ID: 318986

calendar_today

Updated On:

Products

VMware VeloCloud SD-WAN

Issue/Introduction

There's a lot of different options for gathering packet captures from an edge that are useful in different scenarios, but not all of them are well known. The purpose of this KB is document the different options available.

Environment

VMware VeloCloud SD-WAN

Resolution

The most obvious option is to trigger a packet capture from the VCO. Although this will be handy in certain scenarios it has a few drawbacks that make it unsuitable to many situations:
  • There is a 2 minute limit - which will be even less for interfaces with a lot of traffic
  • It requires that the edge is online in the VCO is able to upload the capture to the VCO
  • It can be difficult to time the capture right in cases where we start the capture and then ask the customer to trigger the event. When we trigger the capture the VCO has to wait until the next heartbeat is received from the edge and then send it's request for the capture, so it doesn't start capturing immediately.

So while it's occasionally useful, it's often needed for us to gather packet captures directly via the CLI, which offers us much more flexibility. Once you're in the CLI the first thing you'll want to check is if the ports are using dpdk or not:
 
Edge:~# debug.py --dpdk_ports
name    port   link   strip   speed   duplex   autoneg
Edge:~#

In the above case there's no ports using dpdk. If there are you'll see the dpdk ports listed in the output of that command. If the port is not using dpdk you'll use a regular tcpdump command, whereas if it is using dpdk you'll want to use the shell script, tcpdump.sh. For the most part how you use the command won't change whether it's the native command or the script. 

Basic tcpdump options to keep in mind:
 
Edge:~# tcpdump --help
tcpdump version 4.9.0
libpcap version 1.8.1
Usage: tcpdump [-aAbdDefhHIJKlLnNOpqStuUvxX#] [ -B size ] [ -c count ]
                [ -C file_size ] [ -E algo:secret ] [ -F file ] [ -G seconds ]
                [ -i interface ] [ -j tstamptype ] [ -M secret ] [ --number ]
                [ -Q in|out|inout ]
                [ -r file ] [ -s snaplen ] [ --time-stamp-precision precision ]
                [ --immediate-mode ] [ -T type ] [ --version ] [ -V file ]
                [ -w file ] [ -W filecount ] [ -y datalinktype ] [ -z postrotate-command ]
                [ -Z user ] [ expression ]

-i      choose your interface
-nn   don't try to reverse DNS the IP addresses. This option speeds up the output and makes it easier to parse by IP
-e     adds ethernet (MAC) addresses to the output. Handy when troubleshooting layer 2 issues
-Q     Allows you to capture only incoming or outgoing traffic (i.e. -Q in or -Q out). Could come in handy when troubleshooting layer 2 loops because you could filter by only incoming traffic, then if you see your own packets incoming you know they've been looped
-vv    Increases verbosity of output. Triple vvv for maximum verbosity. This is if you want to see more information in the packet headers.

 

Rolling Captures:


So how do we capture intermittent issues? If an issue you want to capture happens randomly twice a week how do you get a capture of that? If you output everything to a text file via:

Edge:~# tcpdump.sh -i ge3 -nn >> /tmp/testcap
There's two problems with this. In the above command I'm writing to /tmp which is tied to the RAM, so if I fill it up I could crash the edge. The second problem is that even if I put it somewhere safer like /velocloud/log it will still continue filling up indefinitely so space is an issue. We could end up with a 3 Gb capture and that doesn't do any good to anybody.

The solution is a rolling capture. Just like logs fill up to a certain size and roll over we can do the same thing with a tcpdump. This is a capture you can leave running for long periods of time when trying to capture a very intermittent issue. The basic command will look like this:
 
nohup tcpdump.sh -i eth0 -nn host 1.1.1.1 -C 15 -W 4 -w /velocloud/log/capture.pcap &

Let's break this one down piece by piece as you'll need to be able to modify it for your own uses:
nohup         -> this means that when you exit your ssh session it won't send a SIGHUP to your tcpdump process killing it. (i.e. your tcpdump session won't die when you close putty)
tcpdump.sh ->  I used the script here but you can also use the native tcpdump when the situation warrants
-i eth0          ->  choose your interface. Can also be done for a vlan such as,  -i br-network1
-nn host 107.84.191.222  ->  customize your tcpdump options and filters however is appropriate
-C 15           ->  this option decides how many Mbs each file will contain. 15 Mbs in this example.
-W 4            ->  this option decides how many files will be rolled over. So these last 2 options together means it will fill up 4 files of 15 Mbs each -> maximum 60 Mbs of disk space
-w /velocloud/log/capture.pcap  ->  This writes the output to the given file. I chose /velocloud/log on purpose here as that directory usually has a lot of free space and we can collect the pcaps by triggering a diag bundle from the VCO
&                  ->  this causes the process to run in the background so you it doesn't lock up your ssh session

Before doing this and deciding how much space you want your rolling capturing to take up check the available disk space:
Edge:~# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/root               975.9M    293.9M    614.8M  32% /
tmpfs                     1.9G      2.8M      1.9G   0% /tmp
tmpfs                   512.0K         0    512.0K   0% /dev
/dev/sda2                13.7M      2.8M      9.7M  22% /boot
/dev/sda6                 3.9G    795.0M      3.1G  20% /velocloud
Edge:~#

In this case /velocloud has 3.1G of available space. All the edges I've seen so far have a lot of space there, but it's important to make sure.

Important points to keep in mind when using rolling captures:
  • Make sure to kill the process once you've captured the traffic you need. (Check if it's running via ps -elf | grep tcp)
  • Make sure to delete the captures off the edge after you've killed the process and pulled your diag bundle. Otherwise they'll stay there forever taking up disk space and inflating diag bundles.
 

Running the capture remotely:

This is an alternative to the rolling capture, courtesy of Sujith. Surprisingly we can run the capture from a remote machine and store the traffic on that machine instead of the edge. This approach has it's advantages and disadvantages compared to a rolling capture:

 + The advantage is the captures can be as large as the machine can hold. Since the captures are not stored on the edge itself we don't need to worry about using up space on the edge, nor are we limited by the edge disk size
 + This approach can also be used for grabbing captures from gateways. I have not tested nor would I recommend using a rolling capture on a gateway so this is a good alternative
 -  This requires constant ssh connectivity to the edge, so in cases where you want a packet capture of something like an edge's only WAN link flapping then this won't do it - unless you set it up to ssh edge from the lan side which requires more customer involvement.
 -  It requires you log directly into the edge. If you can only log in via the gateway this won't work because that would end up storing the capture on the gateway instead of your local machine - definitely a no no
 - FYI this outputs the same thing we see in the cli when running tcpdump, it does not save in format for view in Wireshark.  Only captures saved with -w save in a format for Wireshark, however using -w with this remote method actually saves the file on the VCE you're connecting to, so -w doesn't work for saving on your local PC.

If capturing using the native tcpdump command the path to the command is not necessary:
 
ssh [email protected] "tcpdump -i br-network1 -nn" >> ~/test_capture
Note: My output directly here is to the home (~) directory, because I was running the command from the windows linux subsystem

If using tcpdump.sh then the command won't work without the full path:
 
human@linux_machine:~$ ssh [email protected] "/opt/vc/bin/tcpdump.sh -i br-network1 -nn" >> ~/test_capture
Password:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-network1, link-type EN10MB (Ethernet), capture size 262144 bytes
^Chuman@linux_machine:~$
human@linux_machine:~$
human@linux_machine:~$ ls -lh | grep test
-rw-rw-rw- 1 human human 4.1K Apr 22 11:20 test_capture
human@linux_machine:~$
 

Calling tcpdump from a script:

For more complex scenarios, if you have some basic scripting skillz you can hack together a script that calls tcpdump whenever the scenario you are tracking triggers. For example, in order to take captures for 5 minutes every hour starting at minute 56 I used this script:

while true
do
    min=`date +"%M"`
    echo $min
    if [ "$min" = "56" ]
    then
        tcpdump.sh -i sfp1 -ennvvv -c 20000 >> /velocloud/log/verbose_capture.pcap
    fi
    sleep 60
done

Here the -c 20000 option causes the tcpdump.sh command to automatically exit after it's captured 20k packets. The power of scripting means that whatever you want to trigger the capture is only limited by your own creativity and skills. That said, I think in most cases a simple rolling capture will get the job done and is much easier to implement.