Hung NAC/Mgmt Server

book

Article ID: 195889

calendar_today

Updated On:

Products

CA Release Automation - Release Operations Center (Nolio) CA Release Automation - DataManagement Server (Nolio)

Issue/Introduction

Suddenly, the NAC/Mgmt server has completely stopped responding. 

Note: 

This article describes a specific/rare scenario. The scenario described here is accompanied with none of the product or tomcat logs being updated for over several minutes. If the product seems very slow, but is still logging information than that's more of a performance problem and this article does not apply. 

Also, this was observed in an environment where there were two NAC servers online. One primary and one secondary (active/passive - supported HA configuration). It is unclear if this played a role in the described behavior. Even in NAC HA setup's, log messages get generated on the secondary NAC in no fewer than every 30 seconds. There is no known explanation for the nolio_dm_all.log missing minutes worth of information other than:

  • Server reboot
  • An adjustment of the server's date/time.

This is rare and should be easy to spot on an actively impacted system - just look at the last few messages of an active log file (like the nolio_dm_all.log on NACs). If the last time it wrote to the log was a few minutes ago then this article applies. 

 

 

Cause

The root cause of this problem needs to be investigated. The "Additional Information" section describes what information would be needed for root cause analysis. The section will also have links to other KB Articles that may seem similar to this.

The "Resolution" section describes steps that have been used to recover Nolio RA. 

 

Environment

Release : 6.6

Component : CA RELEASE AUTOMATION CORE

 

Resolution

To recover from this, please:

 

Additional Information

If this problem has occurred and root cause is needed then the following must be done before attempting to recover.

  • Check if RA service is up and running and identify the process id (not with just running ./nolio_server.sh status but also to check if the process is alive).
    Example:
    > ./nolio_server.sh status
    Nolio ASAP Service is running: PID:31747

  • Verify/Capture the process related information.
    Example:
    > ps aux | grep <pid from step1>
    root      31747  1.1 11.3 6546464 1848500 ?     Sl   Jul16 132:28 ./jre/bin/java_real -Djava.util.logging.config.file=/opt/ReleaseAutomationServer/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Duser.country=US -Duser.language=en -Xms512m -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -Djava.library.path=./bin -Djava.endorsed.dirs=/opt/ReleaseAutomationServer/endorsed -classpath /opt/ReleaseAutomationServer/bin/bootstrap.jar:/opt/ReleaseAutomationServer/bin/tomcat-juli.jar -Dcatalina.base=/opt/ReleaseAutomationServer -Dcatalina.home=/opt/ReleaseAutomationServer -Djava.io.tmpdir=/opt/ReleaseAutomationServer/temp org.apache.catalina.startup.Bootstrap start start

  • If the process is up and running then check if this NAC node is alive:
    GET http://{NAC host}:{NAC port}/datamanagement/availability
    This API call might be executed with the CURL or any other tool

  • Generate a thread dump of the nolio server process.
    • A JDK is needed on the NAC's host to run the command below. If JDK is not installed then:
      • Use the alternative thread dump command mentioned below; or
      • Install JDK. Please do this either before recovering. If it cannot be installed until after recovery then at least it will be there if the problem reoccurs. Gathering this information after the NAC has been stopped/started will not help in identifying root cause.

> jstack -l <pid from step1> >> nolio_server.log

    • Alternatively, if JDK is not available, use the following command to capture a thread dump:
      > kill -3 <pid from step1>
      Note: The output of this command will be sent to the logs/catalina.out file. 

  • Get all NACs logs
    Example:
    > tar -cvpzf ./logs_date.tgz /nac_home_path/logs

  • Logs from the load balancer/haproxy/apache/any other tool which is used to proxy requests to the primary/secondary NAC nodes.

 

 

Related Articles: