Data collectors (DC) stop collecting data after some time
The Data Collecto shows as connected for the whole time
Restarting activemq on the DCs impacted seems to resolve for an indeterminate amount of time, then the data stops and we see it being queued on the DC
Data Collectors impacted may be geographically located in a similar area
Release : 3.7, 20.2, 21,2
Component : IM Polling
Running netstat -anp on the impacted Data Collectors show four connections to the Data Aggregator as it should.
However stands out is the following for some connections:
tcp6 0 <NONZERONUMBERTHATDOESNOTDECREMENT> <IPofDC>:<PORT> <IPofDA>:<PORT> ESTABLISHED <PIDofActiveMQ>/java
The "Recv-Q" and "Send-Q" columns tell us how much data is in the queue for that socket, waiting to be read (Recv-Q) or sent (Send-Q).
The send queue has a <NONZERONUMBERTHATDOESNOTDECREMENT> indicating an issue sending the data from that port, and it is likely not reaching the DA.
This indicates network issues between the DC and DA.
The network is dropping or blocking some (but not all) of our traffic, so the ACKs are not getting back to the DC , so the DC waits and the prefetch bucket gets filled up and ActiveMQ starts caching data as it cannot send it to the Data Aggregator