How to collect kernel debug information when the APM hangs or reboots with "no heartbeat"

book

Article ID: 167858

calendar_today

Updated On:

Products

XOS

Issue/Introduction

Information on collecting debug info when an APM is crashing; this requires console access to the APM.

APM resets, logging only an ambiguous “no heartbeat” message. The cause may be an APM crash or hang.

In some cases an APM will reset as a result of a crash or hang, however /var/log/messages only reports the message “no heartbeat”. The cbssysctrld process will automatically reset the blade whenever the APM crashes, hangs, or does not send out heartbeats. At the time of a crash or hang the APM will report “initializing” and eventually report “down”. The customer may need an additional APM to handle the load while one module is out of service.

Use the following procedure to verify the cause of the reset.

Cause

Collect debug / crash information for analysis of the root cause.

Resolution

1.       Console access on the APM requires keyboard capability. The APM console cable is a straight through ribbon cable with DB9 female connector on either side.

2.       Edit the /crossbeam/etc/cbssysctrld.cf file. To prevent the APM from being rebooted while you are working on it, you must specify which APM should not be rebooted by CPM. The CPM will reboot a module if it does not get a heartbeat response in a given time period.

Locate the "reset_slots" statement and replace the default value ("fff") with the appropriate hexadecimal representation of the binary number which has "0" in a place where the module which is supposed to NOT be rebooted is located in the chassis (only the hexadecimal values to be added). If the "reset_slot" statement is set as comment ("#" character), uncomment the line.

Values for particular slots:

reset_slots=0x00000fff 111111111111 (default value  )
reset_slots=0x000007ff 011111111111 (slot 12 - 12th bit right to left)
reset_slots=0x00000bff 101111111111 (slot 11 - 11th bit right to left)
reset_slots=0x00000dff 110111111111 (slot 10 - 10th bit right to left)
reset_slots=0x00000eff 111011111111 (slot 9 - 9th bit right to left)
reset_slots=0x00000f7f 111101111111 (slot 8 - 8th bit right to left)
reset_slots=0x00000fbf 111110111111 (slot 7 - 7th bit right to left)
reset_slots=0x00000fef 111111101111 (slot 5 - 5th bit right to left)
reset_slots=0x00000ff7 111111110111 (slot 4 - 4th bit right to left)
reset_slots=0x00000ffb 111111111011 (slot 3 - 3rd bit right to left)
reset_slots=0x00000ffd 111111111101 (slot 2 - 2nd bit right to left)
reset_slots=0x00000ffe 111111111110 (slot 1 - 1st bit right to left)

- It is also possible to combine more slots. In this case the binary value must be converted.
Example:
reset_slots=0x00000fdb 111111011011 (slots 3&6  - 3rd and 6th bit right to left)

3.       Restart the cbssysctrld process.

# service cbssysctrld restart

4.       Enable KDB (kernel debug) on the APM if the crash information is not reported directly to the console port.

# rsh <vap member> $ echo 1 > /proc/sys/kernel/kdb

5.       Add echo 1 > /proc/sys/kernel/kdb to /etc/rc.local to be persistent for multiple reset/captures.

6.       The next time the unit becomes unresponsive or if an automatic reboot occurs, at the physical console, you can type ctrl-A at the APM console to break into kdb (kernel debug).

# ctrl a

# kdb>

7.       Log into the APM console and type the following commands to collect backtrace information from each CPU. You can execute the commands against each CPU context (0,1,2,3, and so on) as shown below.

cpu 0
bt
lsmod

cpu 1
bt
lsmod

cpu 2
bt
lsmod

cpu 3
bt
lsmod

cpu 0
bta

cpu 1
bta

cpu 2
bta

cpu 3
bta

 

Revert the changes when all the debug information are collected:

1.       Disable KDB (kernel debug) on the APM.

# rsh <vap member> $ echo 0 > /proc/sys/kernel/kdb

2.       Remove echo  > /proc/sys/kernel/kdb from/etc/rc.local if added before

3.      Edit the /crossbeam/etc/cbssysctrld.cf file and change the "reset_slots" statement to it's default value:
reset_slots=0x00000fff

NOTE:  The cbsoopsd process may not collect all crash information. Setting the APM not to reset (cbssysctrld), then setting it back to default after the condition occurs and the card is down will allow the crash information to be captured by cbsoopsd without the console cable.

Workaround

N/A