I noticed two of our DC's had configuration status failing. After running service dcmd stop and service activemq stop I was able to see all health come up then the Java process failed after 5 minutes running a status command.
Data Collector disconnects from Data Aggregator.
Data Collector dcmd service doesn't stay running after being started.
All supported Performance Management releases
NOTE: This is only observed in environments configured with Fault Tolerant Data Aggregators.
This has not been observed for standalone Data Aggregator environments.
There are full stale irep queues for the affected DCs in the DA AMQ configuration. These are visible via the DA AMQ web UI.
Note in a Fault Tolerant DA configuration, direct the URL to the Active DA, not the Consul Proxy host.
Once logged in select the Queues option.
Filter the available list if queues using the "Queue Name Filter" and the term 'irep'.
The affected DC's will show full/stale DIP-poll.responses.irep-<DCName> and DIP-req.responses.irep-<DCName> Queues. They will have values in "Number of Pending Messages" >0. This indicates the queues aren't being processed and cleared properly.
This is a known issue open as with engineering in defect ID DE478432.
The following are the steps engineering recommends to keep this problem from reaching the point of a down DC and data loss.
We look for any irep queue that is >100. If any are seen we purge them, and restart AMQ on the affected DC to clear the issue.
1. Run this command to see if any queuSize is over 100. If none are, none will be printed. If any show queueSize=">100" it will show it.
/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep | grep -v memoryUsage=0
Sample response that might be seen:
DIM.requests.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=34537 prod=1 consumer=0 enq=418394 deq=383857 fwd=319908 memoryUsage=35
DIP-poll.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=571 prod=1 consumer=0 enq=50714 deq=50143 fwd=48927 memoryUsage=31
DIP-req.responses.irep-<Affected-DC_HostName>:4e2f58d8-4281-4a94-949b-ee4b44d344c7 queueSize=1704 prod=1 consumer=0 enq=14724 deq=13020 fwd=8196 memoryUsage=1
The increasing queueSize value >100 indicates the issue is starting.
If desired, to list all existing irep queues, regardless of queueSize or memoryUsage try this:
/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep
2. When this occurs, when the queueSize starts showing >100, we first need to purge the queues with the script named PurgeOneQueue.
Example output would look like this:
[root@<DA_HostName> opt]# ./PurgeOneQueue DIP-poll.responses.irep-<Affected-DC_HostName>:a304141d-07c8-4e21-9f25-b8259599b330
DIP-poll.responses.irep-<Affected-DC_HostName>_a304141d-07c8-4e21-9f25-b8259599b330
INFO: Loading '/opt/IMDataAggregator/broker/apache-activemq-5.15.8//bin/env'
INFO: Using java '/opt/IMDataAggregator/jre/bin/java'
Java Runtime: AdoptOpenJDK 1.8.0_222 /opt/IMDataAggregator/jre
Heap sizes: current=62976k free=61992k max=932352k
JVM args: -Xms64M -Xmx1G -Djava.util.logging.config.file=logging.properties -Djava.security.auth.login.config=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf/login.config -Dactivemq.classpath=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf:/opt/IMDataAggregator/broker/apache-activemq-5.15.8//../lib/: -Dactivemq.home=/opt/IMDataAggregator/broker/apache-activemq-5.15.8/ -Dactivemq.base=/opt/IMDataAggregator/broker/apache-activemq-5.15.8/ -Dactivemq.conf=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//conf -Dactivemq.data=/opt/IMDataAggregator/broker/apache-activemq-5.15.8//data
Extensions classpath:
[/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/camel,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/optional,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/web,/opt/IMDataAggregator/broker/apache-activemq-5.15.8/lib/extra]
ACTIVEMQ_HOME: /opt/IMDataAggregator/broker/apache-activemq-5.15.8
ACTIVEMQ_BASE: /opt/IMDataAggregator/broker/apache-activemq-5.15.8
ACTIVEMQ_CONF: /opt/IMDataAggregator/broker/apache-activemq-5.15.8/conf
ACTIVEMQ_DATA: /opt/IMDataAggregator/broker/apache-activemq-5.15.8/data
INFO: Purging all messages in queue: DIP-poll.responses.irep-<Affected-DC_HostName>_a304141d-07c8-4e21-9f25-b8259599b330
[root@<DA_HostName> opt]#
We MUST purge any queue listed before moving on to step 3. If not it won't reset the queues, the problem state will remain for the DC.
3. Last we shutdown the activemq service on the affected DC. This allows the dcmd service to continue running and polling. It will restart the activemq service when it see's it's down. Run this:
systemctl stop activemq
A minute or two later it should be running again. Check with:
systemctl status activemq
3. On the DA check the queues again. We may see enq and deq the same but the fwd will be lower. Meanwhile queueSize should be 0 or close to it.
/opt/IMDataAggregator/scripts/activemqstat | awk '{print $1" queueSize="$2" prod="$3" consumer="$4" enq="$5" deq="$6" fwd="$7" memoryUsage="$NF}' | grep irep
To obtain a copy of the PurgeOneScript please engage the Performance Management Support team via a new Support case. Ensure this Knowledge Base article is referenced.