Data Aggregator Data Collector disconnect, do not reconnect after dcmd restart

Products

CA Infrastructure Management CA Performance Management Network Observability

Issue/Introduction

I noticed two of our DC's had configuration status failing. After running service dcmd stop and service activemq stop I was able to see all health come up then the Java process failed after 5 minutes running a status command.

Data Collector disconnects from Data Aggregator.

Data Collector dcmd service doesn't stay running after being started.

Environment

All supported Performance Management releases

NOTE: This is only observed in environments configured with Fault Tolerant Data Aggregators.

This has not been observed for standalone Data Aggregator environments.

Cause

There are full stale irep queues for the affected DCs in the DA AMQ configuration. These are visible via the DA AMQ web UI.

Access the DA AMQ web UI at the URL:
- <DA_HOST>:8161/admin
Log in using the following credentials dependent on the release in use.
- These are internally available. Please open a support case referencing this article to obtain AMQ web UI credentials.

Note in a Fault Tolerant DA configuration, direct the URL to the Active DA, not the Consul Proxy host.

Once logged in select the Queues option.

Filter the available list if queues using the "Queue Name Filter" and the term 'irep'.

The affected DC's will show full/stale DIP-poll.responses.irep-<DCName> and DIP-req.responses.irep-<DCName> Queues. They will have values in "Number of Pending Messages" >0. This indicates the queues aren't being processed and cleared properly.

This is a known issue open as with engineering in defect ID DE478432.

The root cause is a defect in ActiveMQ, a third party tool.
To fix it via update to a newer AMQ release requires update to the apache karaf release in use.
This is pending as part of a larger product update being developed.
No fix is available beyond monitoring Q sizes, clearing them before they cause data loss via down DC.

Resolution

The following are the steps engineering recommends to keep this problem from reaching the point of a down DC and data loss.

We look for any irep queue that is >100. If any are seen we purge them, and restart AMQ on the affected DC to clear the issue.

1. Run this command to see if any queuSize is over 100. If none are, none will be printed. If any show queueSize=">100" it will show it.

/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep | grep -v memoryUsage=0

Sample response that might be seen:

DIM.requests.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=34537 prod=1 consumer=0 enq=418394 deq=383857 fwd=319908 memoryUsage=35
DIP-poll.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=571 prod=1 consumer=0 enq=50714 deq=50143 fwd=48927 memoryUsage=31
DIP-req.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=1704 prod=1 consumer=0 enq=14724 deq=13020 fwd=8196 memoryUsage=1

The increasing queueSize value >100 indicates the issue is starting.

If desired, to list all existing irep queues, regardless of queueSize or memoryUsage try this:

/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep

2. When this occurs, when the queueSize starts showing >100, we first need to purge the queues with the script named PurgeOneQueue.

Place the script on the DA host and configure it as follows:
- We recommend /opt for default install paths where used.
- If using custom install paths:
  - Edit the JAVA_HOME and AMQ_BIN variable paths in the script before running it
  - Set them to the appropriate path for the DA where the script will be run
- Make it executable. Recommended command is:
  - chmod 755 PurgeOneQueue
- Set ownership to the install owner.
- This is the recommended set up, the install owner user or root user running it.
- The script is run with the queue name specified.
  - /opt/PurgeOneQueue <QUEUE_NAME>
- For example using the above output samples we'd run these.
  - /opt/PurgeOneQueue DIP-poll.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7
  - /opt/PurgeOneQueue DIM.requests.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7
  - /opt/PurgeOneQueue DIP-req.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7

Example output would look like this:

[root@<DA_HostName> opt]# ./PurgeOneQueue DIP-poll.responses.irep-<Affected-DC_HostName>:a304141d-07c8-4e21-9f25-b8259599b330
DIP-poll.responses.irep-<Affected-DC_HostName>_a304141d-07c8-4e21-9f25-b8259599b330
INFO: Loading '/opt/IMDataAggregator/broker/apache-activemq-5.15.8//bin/env'
INFO: Using java '/opt/IMDataAggregator/jre/bin/java'
Java Runtime: AdoptOpenJDK 1.8.0_222 /opt/IMDataAggregator/jre
Heap sizes: current=62976k free=61992k max=932352k
JVM args: -Xms64M -Xmx1G -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf/login.config -Dactivemq.classpath=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf:/opt/IMDataAggregator/broker/apache-activemq-5.15.8//../lib/: -Dactivemq.home=/opt/IMDataAggregator/broker/apache-activemq-5.15.8/ -Dactivemq.base=/opt/IMDataAggregator/broker/apache-activemq-5.15.8/ -Dactivemq.conf=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf -Dactivemq.data=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//data
Extensions classpath:
[/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/camel,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/optional,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/web,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/extra]
ACTIVEMQ_HOME: /opt/IMDataAggregator/broker/apache-activemq-5.15.8
ACTIVEMQ_BASE: /opt/IMDataAggregator/broker/apache-activemq-5.15.8
ACTIVEMQ_CONF: /opt/IMDataAggregator/broker/apache-activemq-5.15.8/conf
ACTIVEMQ_DATA: /opt/IMDataAggregator/broker/apache-activemq-5.15.8/data
INFO: Purging all messages in queue: DIP-poll.responses.irep-<Affected-DC_HostName>_a304141d-07c8-4e21-9f25-b8259599b330
[root@<DA_HostName> opt]#

We MUST purge any queue listed before moving on to step 3. If not it won't reset the queues, the problem state will remain for the DC.

3. Last we shutdown the activemq service on the affected DC. This allows the dcmd service to continue running and polling. It will restart the activemq service when it see's it's down. Run this:

systemctl stop activemq

A minute or two later it should be running again. Check with:

systemctl status activemq

3. On the DA check the queues again. We may see enq and deq the same but the fwd will be lower. Meanwhile queueSize should be 0 or close to it.

/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep

Additional Information

To obtain a copy of the PurgeOneScript please engage the Performance Management Support team via a new Support case. Ensure this Knowledge Base article is referenced.