Working with CA Support & Troubleshooting a "Crashing" or "Hanging" CA Service Desk Manager Process on a UNIX/LINUX platform

book

Article ID: 48650

calendar_today

Updated On:

Products

CA IT Asset Manager CA Software Asset Manager (CA SAM) ASSET PORTFOLIO MGMT- SERVER SUPPORT AUTOMATION- SERVER CA Service Desk Manager - Unified Self Service KNOWLEDGE TOOLS CA Service Management - Asset Portfolio Management CA Service Management - Service Desk Manager

Issue/Introduction

Description:

This document provides the appropriate steps to follow when working with CA Support on a "Crashing" or "Hanging" process on a UNIX/LINUX platform

 	** /opt/CAisd is normally the CA Service Desk Manager Install directory, its also termed as $NX_ROOT.

Solution:

Working with CA Support & Troubleshooting a "Crashing" or "Hanging" Process

First Determine if a process "Crashing" or "Hanging"

Many times when a Service Desk process seems to be failing, you may be asked by CA Support if its "crashing" or "hanging" - and it is sometimes difficult to tell the difference. This document seeks to clarify the difference, and provide you with the knowledge needed to be able to differentiate and determine in your case, whether a process has "crashed" or is "hanging." This document is specific to environments where the system is running Linux, AIX, Solaris or HP Unix.

A "Crashing" Process Defined

A Crashing or Crashed process is one that fails in such a way that it either stops running completely, or recycles itself.

A "Hanging" or "Hung" Process Defined

A Hanging or Hung process is one that appears not to be responding, but at the same time, still appears to be in a running state.

To determine if the process has crashed, confirm or answer the following:

  1. After the "crash" does the process still show as running when you run a pdm_status?

  2. After the "crash" does the process still show in task manger in the process list?

  3. In the Service Desk STDLogs, at the time of the "crash" (could also be before, during, or after the occurrence is reported to you), search for the words "died" - and look for any messages with something similar to "xxxxxx process died: restarting" (where xxxxx is a process name such as domsrvr.exe, or webengine.exe).

  4. In the Service Desk STDLogs, at the time of the "crash" (could also be before, during, or after the occurrence is reported to you), search for the words "FATAL" - and look for any FATAL type including an "EXIT", "SIGSEGV" , or "CANNOT ALLOCATE xxxxx BYTES"

If you can answer "No" to #1 and #2, and confirm at least one of the messages in the logs on #3 or #4, then most likely you are experiencing a "crashing" process.

If you answer "Yes" to #1 and #2, and are not able to confirm any of the messages in the logs on #3 or #4, then you are most likely experiencing a "hanging" process.

If a process appears to be in a "hung" state and does not appear to be responding, please confirm this by performing the following steps:

First, run the following command to see if the process responds to a request via the command line: "pdm_diag -a {slump name of process}"

**to get the slump name of the process, you can run the slstat command and pipe it out to a file by running the following command: "slstat > slstat.txt"

Example: If it was a webengine hanging, and you found that the slump name for the failing webengine as per the slstat output is "web:local" you would run the command as follows to see if that webengine process is responding: "pdm_diag -a web:local"

If you receive information back from the process, then the process IS actually responding. If you do not receive information back from the process, and it appears the command is hanging, then the process is most likely in a "hung state" and will not respond with any information.

Then run the following two commands to turn on advanced tracing and logging for the hung process and let it run for about 30 seconds:

"pdm_logstat -n {slump name of process} TRACE""bop_logging {slump name of process} -f $NX_ROOT/log/{processname}.out -n 10 -m 20000000 ON" 

NOTE : In most cases - it is a good practice to turn bop logging on for all domsrvrs, webengines, and spelsrvrs, even the ones that are not hanging or crashing - this will allow CA Support and Sustaining Engineering to see how other processes are being affected by the hanging or crashing process.

Then turn the logging off by running the following commands:

    "pdm_logstat -n {slump name of process}"    "bop_logging {slump name of process} OFF" 
    Example:    Using the same example above for a hanging webengine process, the syntax would be as follows:     "pdm_logstat -n web:local TRACE" "bop_logging web:local -f $NX_ROOT/log/weblocal.out ON" 	 

**the output files for this logging will be included in the Service Desk log directory, so they will be uploaded along with the log directory to the support issue once all required files, output, and info has been gathered.

Steps to take once you have confirmed that you have a "crashing" or "Hanging" process:

It is always best to have a crash dump file generated for a "crashing" or "hanging" process. Once a crash dump file is generated, your CA Support Engineer will work with the Sustaining Engineering Team to try and pinpoint the probable cause of the crash or hang.

Crash dump files can be generated in multiple ways - depending on your environment, and whether the process had been determined to be "crashing" or "hanging."

How to generate a core dump on UNIX/LINUX

Most generic way is to ensure that your system/user level limits do not exist for core dump. These limits can normally be displayed by running a command:

    ulimit -a 

One of them is for core dumps and goes with the -c flag. Below command would set this option to unlimited.

    ulimit -c unlimited 

But changing this limit does NOT immediately affect a running process (ie., if CA SDM is already running). Once you run this command it only affects any process started in your current shell (from now onwards). Once you exit your shell, this change is lost. To make this change available to any CA SDM process, you need to enable the above command in a shell that starts CA SDM. Normally the initial CA SDM processes source a custom profile named /etc/profile.CA on Primary server. On Secondary servers you could just update /etc/profile for this to be enabled.

You could add the above ulimit command to that file (at the end of the file). Then restart CA SDM, from now on the changed limits would apply.

Note:

Some systems/platforms might enforce stringent restrictions on the generation of a core dump file. Below links might be of a great help to understand/overcome these:

SUSE: http://www.novell.com/support/kb/doc.php?id=3054866
Redhat: https://access.redhat.com/site/solutions/5352
Solaris: http://www.oracle.com/technetwork/server-storage/solaris/manage-core-dump-138834.html
AIX: http://publib.boulder.ibm.com/infocenter/realtime/v2r0/index.jsp?topic=%2Fcom.ibm.softrt.aix32.doc%2Fdiag%2Fproblem_determination%2Faix_setup_full_core.html

Note:

To test if the option is working fine or not, you could do a kill -6 <PID_of_a_ServiceDesk_Process> to test if the core dump gets created

What to do after the dump file has been generated:

Once a dump file has been generated for a crashing or hanging process, please fill out a "Crash Dump Template" as supplied to you by CA Support, shown below. This will serve as a checklist for you to gather all the required files, information, and data needed by CA Support to analyze the dump file(s) and help pinpoint the source of the crash or hang. The following is a copy of the UNIX/LINUX Crash Dump Template document - which should be supplied to you by CA Support (separately from this document):

UNIX/LINUX Crash Dump Template

  • Please fill this out as best you can after you capture a dump file for a dying, crashing or hanging process.
  • Simply insert your answers/information to these items in-line below each item.
  • You may cut and paste this template into the issue via support.ca.com, or you may save it and upload it to the issue as an attachment.
  • If you are unsure about a specific item - please ask your CA Support Engineer for clarification.
  1. Review the stdlog file that captures the timeframe of when the dump occurred and supply us with the following information:

    • Was the process ended by a SIGSEGV message, a SIGBUS message or any another "FATAL" type message?

    • What is seen in the stdlog file right before, during, and after the time the process crashed?

    • What errors were reported in the logs right before, during and after the time when the process crashed if any?

    • Specify the filename of the dump file (or zip/tar file that contains the dump file) here.

  2. Location of where the 'core.dmp' file was first found?

  3. Specify the date/time the dump file was generated.

  4. Provide information about the core file by: file core.dmp (change to the directory where the core.dmp resides). Note down the output, which is a process name. Upload the output of this command to the issue.

  5. How many times has the failing process crashed since first reported?

  6. Are there any possible reproducible steps noted prior to when this crash/hang occurs?

  7. Supply a "Recursive Directory Listing" output of the Service Desk root directory (NX_ROOT). You could execute below commands to obtain this information: cd /opt/CAisd ; ls -alR > recursive_nxroot_listing.txt This will generate a file called recursive_nxroot_listing.txt. Please upload this file and specify the name of the file (or tar file that contains the recursive_nxroot_listing.txt file) here.

  8. Navigate to the $NX_ROOT/bin directory and run pdm_ident {process name} > pdm_ident.out. - where {process name} is the name of the Service Desk process for which the dump file was generated. If the failing process is javaw - you will need to run a pdm_ident on the sda65.so file as the javaw process does not contain pdm_ident information. Please upload the pdm_ident.out file, and specify the name of the file (or zip file that contains the output file) here.

  9. Attach your patch history file ($NX_ROOT/<machine name>.his) to the issue and specify the name of the file (or zip file that contains the history file) here.

  10. zip/tar up the entire $NX_ROOT/log directory and attach it to the issue, and specify the name of the file here.

  11. zip/tar up the $NX_ROOT/site/mods directory and attach it to the issue, and specify the name of the file here.

  12. Upload O/S logs /var/log/messages for the time frame when the core file got produced.

  13. Upload the file uname_info.txt as a result of this command: uname -a > uname_info.txt

    1. Output of below command for an event log from the operating system would also be of a great help:

       Solaris: dmesg > dmesg_log.txt LINUX: Create a perl script friendly_dmesg.pl with below content: 
      #!/usr/bin/perl -wn use strict; foreach my $line (<>) { my ($uptime) = (do { local @ARGV='/proc/uptime';<>} =~ /^(\d+)\./); $line=~/^\[\s*(\d+)\.\d+\](.+)/; printf "[%s]%s\n", scalar localtime(time - $uptime + $1), $2; } Now execute below command via command line: dmesg | friendly_dmesg.pl > dmesg_friendly_log.txt
      AIX: alog -t boot -o > alog_log.txt

      ***end of crash dump template***

After Filling Out the Crash Dump Template

Once you have generated the crash dump file, and have gathered all required information, files, and data as per the Crash Dump Template document, please upload everything to your CA Support issue. Please be sure to appropriately label the filenames of all uploaded files so that it is easily visible to CA Support as to which file is which. We have found that the best way to do this is to gather all the files and output first, set appropriate file names, then, under each respective item on the Crash Dump Template Document, simply write the name of the file that corresponds with that item if applicable.

Once all the required files and information has been uploaded to the support issue, your CA Support Engineer will review the information supplied - and will then engage the Sustaining Engineering Team to assist in analysis of the dump files.

What should I do if additional dump files are produced for additional occurrences of the same exact problem on the same server?

Sometimes multiple occurrences will produce multiple dump files. To avoid any confusion and "clouding" of your open support issue with CA Support, do NOT upload the additional dump files and logs without talking to your CA Support Engineer first. There is no need to upload multiple dump files for the same problem unless specifically requested by your CA Support Engineer. The CA Support Team may already have found the problem and may be working on possible resolutions or possible code changes to fix it, and adding these files and additional logs, and updates, may only cloud the issue and make it more difficult to review by others.

What should I do if a similar, but not exactly the same problem occurs on the same server?

If you experience a problem that is similar to the previous occurrence, but not exactly the same - say for example the original problem was a hanging webengine process, and now you are experiencing a hanging spelsrvr process, this problem should be treated as a different problem, and a separate new issue should be opened. The same steps that were followed for the original problem should be followed for this new, slightly different occurrence, including filling out the Crash Dump Template Document, and uploading the files and information specific to the new problem, in the new issue.

What should I do if the same problem (as the original issue) occurs on a different server?

If you experience a problem where the same process crashes or hangs, but on a different server, you should follow all the same steps you did to generate the crash dump, and fill out the Crash Dump Template Document with regards to the different server where the new crash or hang has occurred. You may upload the new crash dump, along with the filled out Crash Dump Template Document, and all required information and files to the original issue - however, you MUST make sure that the files are ALL appropriately labeled so it is easily visible that they are from a different server from than the original issue occurred on. The best way to do this is to zip up ALL of the files for this new occurrence on a different server, into one zip file specifically labeled with the second server name, and the date of the occurrence.

Environment

Release: UAPMAC990JPP-12.6-Asset Portfolio Management-Asset Configuration
Component: