Data collection stopped for some devices in DX NetOps Performance Management
search cancel

Data collection stopped for some devices in DX NetOps Performance Management

book

Article ID: 137171

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

When running a monthly report, it showed that data collection had stopped for some device components.  

For example, there are some devices that are being polled in Performance Management where there is no data for the interfaces, but there is data for the CPU/Memory. Why? 

What made it stop and how do we identify and catch this so it does not occur again?

Environment

Performance Management all Supported Releases

 

Cause

 The root cause of this problem is clock drift between the servers (PC, DA, DC and/or DR). This is shown in the date/time stamps in the DC karaf.log (under <IMDataCollector_HOME>/apache-karaf-2.4.3/data/log directory) drifting over time. For example:

 

2019-09-10 12:59:35,761 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4

2019-09-10 13:01:35,762 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4

2019-09-10 13:03:35,763 | INFO | r-Timer-thread-1 | KahaDBFileMonitor | .health.kahadb.KahaDBFileMonitor 98 | 199 - com.ca.im.data-collection-manager.health - 3.7.2.RELEASE-393 | | Number of Kaha DB files: 4

 

These should be exactly 2 minutes (down to the milliseconds) apart. However, if the milliseconds here are drifting upwards, then this causes a loss of synch after a while and hence the dropping of polls as shown in the error below which may be repeated many times in the DC karaf.log:

 

2019-09-10 13:23:32,861 | ERROR | l 60000-thread-1 | PollerScheduledExecutor | r.common.PollerScheduledExecutor 290 | 191 - com.ca.im.data-collection-manager.core.common - 3.7.2.RELEASE-393 | | Executor Scheduler B for poll interval 60000 for poll Cycle : 1568085360000 (Tue Sep 10 13:16:00 AEST 2019)  dropped poll requests=1

 

Resolution

Check that all 4 servers (PC, DA, DC and DR) are synchronized via NTP or chrony.

The only way to fix this type of problem at the moment is to restart the DC.

Additional Information

We are working on making the system more robust to time drift so that it restarts polling itself after dropping polls and we're looking at this for a future release.