Outage in Data Collectors observed where services are up and running but there is an issue with the Data Collector processes, with polling being impacted. Failover to standby Data Collector server does not occur for one or more DCs:
DX NetOps CAPM Release : 20.2 or later
This may be caused by an out-of-memory issue affecting the Java based processes, in particularly, ActiveMQ (broker) which manages communications between the processes on the DC and the DA. For example, in the activeMQ log:
<IMDataCollector_Install_DIR>/broker/apache-activemq-<VERSION>/data
You may find the following errors:
2023-04-03 14:15:13,010 | ERROR | Checkpoint failed | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal Checkpoint Worker
java.lang.OutOfMemoryError: Java heap space
at org.apache.activemq.store.kahadb.disk.util.DataByteArrayOutputStream.<init>(DataByteArrayOutputStream.java:47)
at org.apache.activemq.store.kahadb.disk.page.Transaction$1.<init>(Transaction.java:283)
at org.apache.activemq.store.kahadb.disk.page.Transaction.openOutputStream(Transaction.java:283)
at org.apache.activemq.store.kahadb.disk.page.Transaction.store(Transaction.java:260)
at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1826)
at org.apache.activemq.store.kahadb.MessageDatabase$18.execute(MessageDatabase.java:1792)
at org.apache.activemq.store.kahadb.MessageDatabase$18.execute(MessageDatabase.java:1789)
at org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:810)
at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1789)
at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:1104)
at org.apache.activemq.store.kahadb.MessageDatabase$CheckpointRunner.run(MessageDatabase.java:445)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
About 5 mins later, the comms stack starts breaking down and you lose connectivity:
2023-04-03 14:20:26,714 | WARN | Transport Connection to: tcp://127.0.0.1:39604 failed: Cannot send, channel has already failed: tcp://127.0.0.1:39604 | org.apache.activemq.broker.TransportConnection.Transport | Async Exception Handler
This then results in losing connectivity to the DA:
2023-04-03 16:56:25,246 | ERROR | Failed to connect to [tcp://<DA_HOST>:61618, tcp://nopsda01a:61618] after: 3 attempt(s) | org.apache.activemq.transport.failover.FailoverTransport | ActiveMQ Task-3
2023-04-03 16:56:25,249 | WARN | Network connection between vm://dc_broker_xxxx_xxxx_xxxx_xxxx#20 and unconnected shutdown due to a remote error: java.net.SocketException: Broken pipe (Write failed) | org.apache.activemq.network.DemandForwardingBridgeSupport
| ActiveMQ Task-3
2023-04-03 16:56:25,246 | ERROR | Failed to connect to [tcp://<DA_HOST>:61616, tcp://<DA_HOST>:61616] after: 3 attempt(s) | org.apache.activemq.transport.failover.FailoverTransport | ActiveMQ Task-3
2023-04-03 16:56:25,250 | WARN | Network connection between vm://dc_broker_xxxx_xxxx_xxxx_xxxx#60 and unconnected shutdown due to a remote error: org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: tcp://<DA_HOST>:61616 | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ Task-3
2023-04-03 16:56:25,255 | WARN | Caught an exception processing local command | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ BrokerService[dc_broker_xxxx_xxxx_xxxx_xxxx] Task-126633
org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: tcp://<DA_HOST>:61616
at org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:328)
at org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:317)
at org.apache.activemq.transport.WireFormatNegotiator.sendWireFormat(WireFormatNegotiator.java:181)
at org.apache.activemq.transport.WireFormatNegotiator.sendWireFormat(WireFormatNegotiator.java:84)
at org.apache.activemq.transport.WireFormatNegotiator.start(WireFormatNegotiator.java:74)
at org.apache.activemq.transport.failover.FailoverTransport.doReconnect(FailoverTransport.java:1022)
at org.apache.activemq.transport.failover.FailoverTransport$2.iterate(FailoverTransport.java:150)
at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-04-03 16:57:23,018 | INFO | Establishing network connection from vm://dc_broker_xxxx_xxxx_xxxx_xxxx to failover:(tcp://<DA_HOST>:61616,tcp://<DA_HOST>:61616)?maxReconnectAttempts=3 | org.apache.activemq.network.DiscoveryNetworkConnector | ActiveMQ Task-2
2023-04-03 16:57:23,018 | INFO | Establishing network connection from vm://dc_broker_xxxx_xxxx_xxxx_xxxx to failover:(tcp://<DA_HOST>:61618,tcp://<DA_HOST>:61618)?maxReconnectAttempts=3 | org.apache.activemq.network.DiscoveryNetworkConnector | ActiveMQ Task-2
2023-04-03 16:57:23,017 | ERROR | Failed to connect to [tcp://<DA_HOST>:61622, tcp://<DA_HOST>:61622] after: 3 attempt(s) | org.apache.activemq.transport.failover.FailoverTransport | ActiveMQ Task-7
2023-04-03 16:57:23,010 | INFO | Network Could not shutdown in a timely manner | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ BrokerService[dc_broker_ed8de007-d754-4607-94be-218620f66182] Task-126637
2023-04-03 16:57:23,255 | WARN | Network connection between vm://dc_broker_xxxx_xxxx_xxxx_xxxx#12 and unconnected shutdown due to a remote error: org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: tcp://<DA_HOST>:61622 | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ Task-7
And so ActiveMQ on the DCs will not be able to communicate with the DA.
As a temporary fix, you can restart the processes (both activemq & dcmd). When you restart, it will flush the memory clean, which would bring it all back online.
However, increasing the memory of the DC is the only long term solution. You can use the Online sizing tool for guidance: