When running a monthly report, it showed that data collection had stopped for some device components.
For example, there are some devices that are being polled in Performance Management where there is no data for the interfaces, but there is data for the CPU/Memory. Why?
What made it stop and how do we identify and catch this so it does not occur again?
Performance Management all Supported Releases
The root cause of this problem is clock drift between the servers (PC, DA, DC and/or DR). This is shown in the date/time stamps in the DC karaf.log (under <IMDataCollector_HOME>/apache-karaf-2.4.3/data/log directory) drifting over time. For example:
2019-09-10 12:59:35,761 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4
2019-09-10 13:01:35,762 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4
2019-09-10 13:03:35,763 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4
These should be exactly 2 minutes (down to the milliseconds) apart. However, if the milliseconds here are drifting upwards, then this causes a loss of synch after a while and hence the dropping of polls as shown in the error below which may be repeated many times in the DC karaf.log:
2019-09-10 13:23:32,861 | ERROR | l 60000-thread-1 | PollerScheduledExecutor | r.common.PollerScheduledExecutor 290 | 191 - com.ca.im.data-collection-manager.core.common - 3.7.2.RELEASE-393 | | Executor Scheduler B for poll interval 60000 for poll Cycle : 1568085360000 (Tue Sep 10 13:16:00 AEST 2019) dropped poll requests=1
Check that all 4 servers (PC, DA, DC and DR) are synchronized via NTP or chrony.
The only way to fix this type of problem at the moment is to restart the DC.
We are working on making the system more robust to time drift so that it restarts polling itself after dropping polls and we're looking at this for a future release.