OutOfMemory error from ActiveMQ causes Data Collector disconnects

book

Article ID: 141779

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

Data Collectors disconnected and polled stopped when the Data Aggregator iRep Queue became full. This may have been triggered by OutOfMemory (OOM) errors in the Data Collectors ActiveMQ service. When that happens, without an ActiveMQ restart it may lead to data loss or other Data Collectory discovery and Inventory failures.

Cause

ActiveMQ OutOfMemory (OOM) errors can trigger processing problems in the ActiveMQ queues without the ActiveMQ service being restarted. Once ANY DA irep queue in ActiveMQ for a DC stops being processed by the DC, the DA no longer consumes from DADistIrepManager, which will cause DCs to fail to restart.

OutOfMemory (OOM) errors can be seen in the activemq.log files on the Data Collectors which are the trigger for the problem. The logs are found in the (default path) /opt/IMDataCollector/broker/apache-activemq-<version>/data directory.

Sample of the errors seen:

2019-12-08 07:17:32,408 | ERROR | Checkpoint failed | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal Checkpoint Worker
java.lang.OutOfMemoryError: Java heap space
2019-12-08 07:17:32,408 | INFO  | Ignoring no space left exception, java.io.IOException: Java heap space | org.apache.activemq.util.DefaultIOExceptionHandler | ActiveMQ Journal Checkpoint Worker
java.io.IOException: Java heap space

Environment

Performance Management releases r3.7.4 and older

Resolution

The r3.7.5 release contains new code, via defect DE409714, which enables an ActiveMQ service restart automatically on Data Collectors when ActiveMQ records an OOM error.

To take advantage of the same change until able to install release r3.7.5 or newer, the following changes will help resolve this.

On the Data Collector edit the activemq script found in the directory (default path) /opt/IMDataCollector/scripts.

There are two changes we need to make in that file.

A. The following are the changes that should be made near the top of the file. This is what a default file looks like.

ACTIVEMQ_OPTS="$ACTIVEMQ_OPTS_MEMORY -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=$activemqhome/conf/login.config";export ACTIVEMQ_OPTS

This is what it would look like after the recommended changes.

# Next line is new to enable AMQ restart on OOO re: DE409714 fixed in r3.7.5 and newer releases
ACTIVEMQ_OPTS_OOM="-XX:OnOutOfMemoryError='$dchome/scripts/activemq restart'"
# Original line
# ACTIVEMQ_OPTS="$ACTIVEMQ_OPTS_MEMORY -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=$activemqhome/conf/login.config";export ACTIVEMQ_OPTS
# New line to enable AMQ restart on OOO re: DE409714 fixed in r3.7.5 and newer releases
ACTIVEMQ_OPTS="$ACTIVEMQ_OPTS_MEMORY $ACTIVEMQ_OPTS_OOM -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=$activemqhome/conf/login.config";export ACTIVEMQ_OPTS

B: The second change is in the section that begins with:

start() {
    echo "Starting ActiveMQ"

In the line after the done statement for ACTIVEMQ_OPTS we need to add the ACTIVEMQ_OPTS_OOM reference added to the top of the file. After editing it should be:

 ACTIVEMQ_OPTS="$ACTIVEMQ_OPTS_MEMORY $WILY_OPTS $ACTIVEMQ_OPTS_OOM -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=$activemqhome/conf/login.config";export ACTIVEMQ_OPTS

To make the changes:

  1. Edit the file.
  2. Stop the AMQ service allowing the DC dcmd to restart it on it's own.

After the AMQ service is restarted, the new process listing should contain "-XX:OnOutOfMemoryError=/opt/IMDataCollector/scripts/activemq restart" which indicates the change was made successfully.

After that, if the DC's encounter further OOM errors, the AMQ will be restarted automatically.

Additional Information

Did the ActiveMQ service restart on my Data Aggregator? If the auto-restart is used, there are a few things that will show it.

  • The ActiveMQ service will be restarted with a new PID
  • The following message would be seen in the activemq.log file. This is a sample from a lab where it was restarted 12/16/19 @ 10:09 AM EST.
    • 2019-12-16 10:09:12,393 | INFO | Refreshing [email protected]: startup date [Mon Dec 16 10:09:12 EST 2019]; root of context hierarchy | org.apache.activemq.xbean.XBeanBrokerFactory$1 | main