[Internal] How to quickly find call traces in a core file and match them to an issue.
search cancel

[Internal] How to quickly find call traces in a core file and match them to an issue.

book

Article ID: 323763

calendar_today

Updated On:

Products

VMware SD-WAN by VeloCloud

Issue/Introduction

When working with crashes and core files, retrieving the necessary data to search for bugs can consume some time. This KB aims to assist you in finding the data and searching for answers quickly.

Note that this KB does not explain how to analyze the core file itself. For this, please check out KB 57704.

Environment

VMware SD-WAN by VeloCloud

Resolution

When we have a crash, we normally deal with one of two types
  • A kernel core.
    • This is a crash that occurred at the kernel level of the system. This could mean an issue with the kernel or the hardware. A kernel crash means the device had to be rebooted.
  • A user space core.
    • This means the crash happened to a specific user space application, and not necessarily affected the rest of the appliance (though exceptions may occur).
In order to identify what caused the crash, we need to look at the call traces of the core file. This varies on the type of cores mentioned previously. Before we do that, however, we need to extract the core file itself.
  1. After the crash happened, generate a bundle of the affected edge (VCO or Local web UI).
  2. Navigate to the /velocloud directory.
    • Older versions and gwd cores may store the core in /var/core/
    • Depending on the type of core, you will see a folder called "core" for user space app crashes or "kcore" for kernel crashes.
    • User space cores will start the file name with the name of the crashed user space app.
      • Example: edged.5524.3.1567185608.core.tgz
  3. Extract the core. This can be done in two ways:
    • Using a tool like 7Zip.
    • Running the following command in a linux terminal:
      • tar -zxf <*.core.tgz>
        • You can add -v for verbose output
  4. You should see the following file:
    • *.core-info.txt
Now that we have extracted the data, we need to find the traces. For this we are going to use a linux terminal like Ubuntu or MobaXTerm for Windows.

User space app cores:

If the file is a user space app core, run the following command in the directory where the core-info file is:
  • cat *info.txt | awk '/^Thread 1 /,/^----/' | egrep "^Thread|^#"
From here you should see the traces captured during the core file. The output should be similar to this:
Thread 1 (Thread 0x7fd00d7fa700 (LWP 12710)):
#0  0x00000000007d868e in cntr_del (cmgr=0x7fcf5a588810, ptr=0x7fcf844c8490) at /mnt/build/workspace/Release-3.2.2/common/libs/counters/cntr_mgr
#1  0x000000000086148e in delete_link_counters (link=0x7fcf5a86f720, cm=0x7fcf5a588810) at /mnt/build/workspace/Release-3.2.2/common/libs/vcmp_d
#2  0x00000000007e1973 in vc_link_delete_local_link (link=link@entry=0x7fcf5a86f720, cm=0x7fcf5a588810) at /mnt/build/workspace/Release-3.2.2/co
#3  0x00000000008f0cd7 in vc_net_link_cleanup (link=0x7fcf5a86f720) at /mnt/build/workspace/Release-3.2.2/gateway/libs/link_fsm/gwd_link_sm.c:77
#4  vc_net_link_put (link=0x7fcf5a86f720, type=type@entry=2) at /mnt/build/workspace/Release-3.2.2/gateway/libs/link_fsm/gwd_link_sm.c:103
#5  0x00000000007db79f in lm_destroy_td (td=td@entry=0x7fcf5a84cfd0) at /mnt/build/workspace/Release-3.2.2/common/libs/linkmgr/linkmgr.c:2101


Note the traces highlighted:
cntr_del
delete_link_counters
vc_link_delete_local_link
vc_net_link_cleanup
vc_net_link_put

From here, move on to the JIRA search step.

Kernel cores:

If you have a kernel core, then navigate to /var/log. Here you should see a file called" dmesg.*". From here, you can run the following command:

sed -n '/cut\ here/,${/cut\ here/!p;}' dmesg.*

This command should show you the Call Traces, as well as other useful information for engineering if required.
 

[864126.548483] Call Trace:
[864126.551320]  [<ffffffff810f8630>] ? page_referenced_one+0x130/0x130
[864126.558442]  [<ffffffff810dd81e>] ? shrink_page_list+0x5ce/0x8d0
[864126.565270]  [<ffffffff810de132>] ? shrink_inactive_list+0x272/0x3b0
[864126.572487]  [<ffffffff810de901>] ? shrink_lruvec+0x3a1/0x580
[864126.579021]  [<ffffffff810deb13>] ? shrink_zone+0x33/0x120
[864126.585264]  [<ffffffff810df48a>] ? balance_pgdat+0x33a/0x500
[864126.591799]  [<ffffffff810df8b2>] ? kswapd+0x262/0x2d0
[864126.597654]  [<ffffffff8108e4c0>] ? __wake_up_sync+0x10/0x10
[864126.604092]  [<ffffffff810df650>] ? balance_pgdat+0x500/0x500
[864126.610628]  [<ffffffff81074eb2>] ? kthread+0xd2/0xe0
[864126.616383]  [<ffffffff81074de0>] ? kthread_create_on_node+0x170/0x170
[864126.623797]  [<ffffffff814dd518>] ? ret_from_fork+0x58/0x90
[864126.630136]  [<ffffffff81074de0>] ? kthread_create_on_node+0x170/0x170


Note the traces highlighted:

  • page_referenced_one
  • shrink_page_list
  • shrink_inactive_list
  • shrink_lruvec
  • shrink_zone
  • balance_pgdat
  • kswapd
  • __wake_up_sync
  • balance_pgdat
  • kthread
  • kthread_create_on_node
  • ret_from_fork
From here, move on to the JIRA search step.
 

Pulling the backtrace from a core directly in the CLI

There are times when getting the core off the edge is difficult (edge is down in the VCO or core file is too large for a diag bundle). In these scenarios we can actually pull the backtrace out right in the CLI. We don't want to unzip the whole core on the edge as these can be very large and take up too much space on the Edge's flash. Instead, we can unzip just the core-info.txt file and then cat it:

1. Find the core you need
velocloud Edge:/velocloud/core# ls -lh
-rw-r--r--    1 root     root        5.8M Jun 28 10:54 edged.6270.3.1561719199.core.tgz
velocloud Edge:/velocloud/core#

2. Use this command to see what files are in the core without extracting it
velocloud Edge:/velocloud/core# tar -tvf edged.6270.3.1561719199.core.tgz
-rw-r--r-- root/root     25088 2019-06-28 10:53 edged.6270.3.1561719199.core-info.txt
-rw------- root/root 1292709888 2019-06-28 10:53 edged.6270.3.1561719199.core
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
velocloud Edge:/velocloud/core#

3. The file that ends in core-info.txt is the one you want. Now you use the following command to extract just that file from the edge.
The command is in the format:  tar -xf <core-filename> <file-you-want-to-extract>
velocloud Edge:/velocloud/core# tar -xf edged.6270.3.1561719199.core.tgz edged.6270.3.1561719199.core-info.txt
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
velocloud Edge:/velocloud/core#

4. Now you can see the extracted file:
velocloud Edge:/velocloud/core# ls -lh
total 6020
-rw-r--r--    1 root     root       24.5K Jun 28 10:53 edged.6270.3.1561719199.core-info.txt
-rw-r--r--    1 root     root        5.8M Jun 28 10:54 edged.6270.3.1561719199.core.tgz
velocloud Edge:/velocloud/core#

 

5. The final step should be easy for almost everybody. Just cat the file and copy the backtrace to a text editor.
i.e. # cat edged.6270.3.1561719199.core-info.txt | more

From here, move on to the JIRA search step.
 

JIRA search

With these, you should be able to make an advanced jira search for VLENGs or VLPRs using the following formula:

text ~ "input1 input2 input3"

You can also add project = Engineering for VLENGs, or project = VLPR  for VLPRs. (For the second one after you type in vlpr you'll see a dropdown option for "Velocloud Problem Reports". Choose that one.)

Example:
project = Engineering and text ~ "cntr_del delete_link_counters vc_link_delete_local_link vc_net_link_cleanup vc_net_link_put"

What not to do: text ~ "term1" AND text ~ "term2" AND text ~ "term3". That's primitive and unnecessary.
You'll get the same results by doing  text ~ "term1 term2 term3"

A tip is to search every word until you hit an exact match in a JIRA ticket or bug. If no exact match, start removing key words until you get a match, then consult with an EE if the bug or issue applies to you. Remember to search both for matching VLPRs (in https://servicedesk.eng.vmware.com/projects/VLPR/queues/custom/705 ) and for matching VLENG tickets (in https://jira.eng.vmware.com/projects/VLENG/issues/ )

If you find a matching core check to see if the affected versions and triggers match. If the core happened on a version with the fix then it doesn't count as a match. Also some cores are very generic and may not be a match even if the backtrace and version match - I'm looking at you mutex mon cores. For those ones if you want to be sure you'll still need a VLPR.

Another tip - there may be a core not included in the diag bundle due to size. If you don't see the core in the diag you need to double-check the cli to see if there's really a core or not. If it's HA make sure to check the active and standby.


Workaround:
If you are not able to find an issue or bug using this method, or the output of the commands are not as expected, then please refer to our core analysis KB and engage an EE for assistance in isolating the issue or engaging Engineering.

Additional Information

How to analyze a core file:
https://ikb.vmware.com/s/article/57704?lang=en_US