Data Aggregator stops every hour
Dx NetOps Performance Management Any Version
The data aggregator stops hourly, looking in the /opt/IMDataAggregator/apache-karaf-*/shutdown.log we see errors similar to:
ERROR | Manager-thread-5 | YYYY-MM-DD HH:MM:SS,### | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[DRNODE1_HOSTNAME, DRNODE2_HOSTNAME,DRNODE3_HOSTNAME]
This indicates that no Data Repository hosts can be contacted, therefore the Data Aggregator shuts down.
There is traffic passing betwwen the Data Aggregator and Data Repository.
We see the error above at a regular interval, such as about 1 hour and 5 minutes after the DA started last
Check the operating system and any intermediate network devices such as firewalls for the currently configured TCP idle timeout. If the Data Repository node cannot be contacted 5 minutes, it is marked as down.
When all nodes are uncontactable the Data Aggregator will shut down.
We use a connection pool from the Data Aggregator to Data Repository
The Data Aggregator sends a heartbeat query to each Data Repository node every 10 seconds, this would be over a TCP connection from the Data Aggregator to the Data Repository port 5433
Either at the OS levels or on an intermediate network device such as a firewall, there is a TCP idle connection timeout that is dropping one of the TCP connections in the pool.
The heartbeat can use any connection in the pool so if the connection is blocked/dropped by the firewall and we try to use that connection, we get the timeout.
We have seen this most commonly when the TCP idle timeout is set to 3600 seconds on the firewall.
When Increasing this to a larger value, the error is no longer seen.