Sometimes, the metric charts or reports for some items/devices in NetOps Portal display gaps indicating interrupted polling or responses. The following are the main cause of missing data in NetOps Portal graph.
Version: Any
To determine what the cause is, check the following criteria:
Cause A: SNMP timeout (device is not responding or delaying to polling)
If the following error appeared in the "Poll Errors by IP" log at the DcDebug -- then its Type A.
POLLING_ERROR: errors for cycle 1491007500000: [REQUEST_TIMED_OUT]
By default, the maximum response time set is 9 seconds. Please reference the "Modify the Timeout and Retries Parameters" section of the DX NetOps Performance Management documentation for more information.
Moreover frequent SNMP time-out occurrence generates CA Data Aggregator polling stop event. Please reference the "Polling Stopped Event Message" section of the DX NetOps Performance Management documentation for more information.
If these errors or issues are the cause, then look to increase the Timeout and/or Retries parameter of the CA Performance Center SNMP Profile used for these items/devices.
Cause B: The device only supports SNMP 32bit counter
If the following WARN message appears in the Data Collector karaf log -- then its Cause B:
com.ca.im.data-collection-manager.core.interfaces - | | Counter value rolled over, dropping response: previous=4285888934 / current=4049163 for IP IP address, OID polling OID, item ID id, in poll group gid
Further counter rollover messages for this IP will be suppressed unless DEBUG is enabled or the DC is restarted.
When the SNMP Counter rollover occurs within one polling cycle, the polling data will be missing since the counter the metric is based is no longer valid as per the "Configure Counter Behavior" section of the DX NetOps Performance Management documentation.
This may happens when the monitoring device supports only 32bit counters or only SNMPv1. The following error may appeared in the "Discover Logging by IP" log at the DcDebug when getting SNMP 64bit counter MIB for those box.
Finished on demand read. Response = SnmpResponse [error=SNMP_PARTIAL_FAILURE, errorIndex=-1, queriedIP=Device IP]
? SnmpResponseVariable [oid=Polling OID, type=NULL, value={}, isDelta=false, isList=true, error=NO_SUCH_NAME, isDynamicIndex=false, indexList=[]]
A possible workaround is to shorten the poll interval from 5 minutes to 1 for the device. Please reference the "Poll Critical Interfaces Faster than Non-critical Interfaces" section of the DX NetOps Performance Management documentation for more information.
Cause C: CA Data Aggregator is slow with high load
If the following WARN message appeared in the Data Aggregator karaf log -- then its Cause C.
WARN | tory-thread-id | date time | onitoringProcessLimitManagerImpl | onitoringProcessLimitManagerImpl 98 | .ca.im.aggregator.loader | | Threshold Monitoring processing took too long. The system will shut that feature down in 15 minutes if the threshold monitoring continues to exceed capcacity
And at the same time the following event occurs and is shown in the CA Performance Center Event List:
The Threshold Monitoring Engine has transitioned to a degraded state.
You will also see the some peak in the following graph chart at the event time.
If the above is the case, then look to increase the PercentOfPollCycleThreshold value. Please reference the "Threshold Monitoring and Threshold Limiter Behavior" section of the DX NetOps Performance Management documentation for more information.
DcDebug is the built-in discovery and polling debug tool. You can access and use it as follows: