Data Aggregator stops every hour
search cancel

Data Aggregator stops every hour

book

Article ID: 139079

calendar_today

Updated On:

Products

Network Observability CA Performance Management

Issue/Introduction

Data Aggregator stops every hour

Environment

Dx NetOps Performance Management Any Version

Cause

The data aggregator stops hourly, looking in the /opt/IMDataAggregator/apache-karaf-*/shutdown.log we see errors similar to:

 

ERROR | Manager-thread-5 | YYYY-MM-DD HH:MM:SS,### | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[DRNODE1_HOSTNAME, DRNODE2_HOSTNAME,DRNODE3_HOSTNAME]
This indicates that no Data Repository hosts can be contacted, therefore the Data Aggregator shuts down.

There is traffic passing betwwen the Data Aggregator and Data Repository.

We see the error above at a regular interval, such as about 1 hour and 5 minutes after the DA started last


Check the operating system and any intermediate network devices such as firewalls for the currently configured TCP idle timeout. If the Data Repository node cannot be contacted 5 minutes, it is marked as down.

When all nodes are uncontactable the Data Aggregator will shut down.

We use a connection pool from the Data Aggregator  to Data Repository 

 

The Data Aggregator sends a heartbeat query to each Data Repository node every 10 seconds, this would be over a TCP connection from the Data Aggregator  to the Data Repository port 5433 

 

Either at the OS levels or on an intermediate network device such as a firewall, there is a TCP idle connection timeout that is dropping one of the TCP connections in the pool.

 

The heartbeat can use any connection in the pool so if the connection is blocked/dropped by the firewall and we try to use that connection, we get the timeout.

 

Resolution

We have seen this most commonly when the TCP idle timeout is set to 3600 seconds on the firewall.

 

When Increasing this to a larger value, the error is no longer seen.