Best practice for resolving NetOps Portal when graph chart plot data lack/missing

Products

CA Infrastructure Management CA Performance Management Network Observability

Issue/Introduction

Sometimes, the metric charts or reports for some items/devices in NetOps Portal display gaps indicating interrupted polling or responses. The following are the main cause of missing data in NetOps Portal graph.

A: SNMP timeout (device is not responding or delaying to polling)
B: The device only supports SNMP 32bit counter.
C: CA Data Aggregator is slow with high load.

Environment

Version: Any

Cause

To determine what the cause is, check the following criteria:

If any error/s are generated during the time the issue occurs in the following Data Aggregator and Data Collector logs:

/opt/IMDataAggregator/apache-karaf-*/data/log/*
/opt/IMDataCollector/apache-karaf-*/data/log/*
The "Number of Event Rules Evaluated" and "Percentage of Poll Cycle of Complete Event Processing" chart in the Data Aggregator Pages and the Data Aggregator health charts on the NetOps Portal System Health tab.
Run DcDebug (* See Additional Information) for the problem device.
Confirm if the monitored device physically changed.

Whether the device Status does not set as "Management Lost" in the NetOps Portal Administration menu > Monitored Devices > select the device > Details tab
Whether the SNMP Poll Rate does not set as "true-null" in the NetOps Portal Administration menu > Monitored Devices > select the device > Polled Metric Families tab > select Interface Metric Family line > see right below pane Components list

Resolution

Cause A: SNMP timeout (device is not responding or delaying to polling)

If the following error appeared in the "Poll Errors by IP" log at the DcDebug -- then its Type A.

POLLING_ERROR: errors for cycle 1491007500000: [REQUEST_TIMED_OUT]

By default, the maximum response time set is 9 seconds. Please reference the "Modify the Timeout and Retries Parameters" section of the DX NetOps Performance Management documentation for more information.

Moreover frequent SNMP time-out occurrence generates CA Data Aggregator polling stop event. Please reference the "Polling Stopped Event Message" section of the DX NetOps Performance Management documentation for more information.

If these errors or issues are the cause, then look to increase the Timeout and/or Retries parameter of the CA Performance Center SNMP Profile used for these items/devices.

Cause B: The device only supports SNMP 32bit counter

If the following WARN message appears in the Data Collector karaf log -- then its Cause B:

com.ca.im.data-collection-manager.core.interfaces - | | Counter value rolled over, dropping response: previous=4285888934 / current=4049163 for IP IP address, OID polling OID, item ID id, in poll group gid
Further counter rollover messages for this IP will be suppressed unless DEBUG is enabled or the DC is restarted.

When the SNMP Counter rollover occurs within one polling cycle, the polling data will be missing since the counter the metric is based is no longer valid as per the "Configure Counter Behavior" section of the DX NetOps Performance Management documentation.

This may happens when the monitoring device supports only 32bit counters or only SNMPv1. The following error may appeared in the "Discover Logging by IP" log at the DcDebug when getting SNMP 64bit counter MIB for those box.

Finished on demand read. Response = SnmpResponse [error=SNMP_PARTIAL_FAILURE, errorIndex=-1, queriedIP=Device IP]
? SnmpResponseVariable [oid=Polling OID, type=NULL, value={}, isDelta=false, isList=true, error=NO_SUCH_NAME, isDynamicIndex=false, indexList=[]]

A possible workaround is to shorten the poll interval from 5 minutes to 1 for the device. Please reference the "Poll Critical Interfaces Faster than Non-critical Interfaces" section of the DX NetOps Performance Management documentation for more information.

Cause C: CA Data Aggregator is slow with high load

If the following WARN message appeared in the Data Aggregator karaf log -- then its Cause C.

And at the same time the following event occurs and is shown in the CA Performance Center Event List:

The Threshold Monitoring Engine has transitioned to a degraded state.

You will also see the some peak in the following graph chart at the event time.

The "Number of Event Rules Evaluated" and "Percentage of Poll Cycle of Complete Event Processing" chart in the Data Aggregator Pages

If the above is the case, then look to increase the PercentOfPollCycleThreshold value. Please reference the "Threshold Monitoring and Threshold Limiter Behavior" section of the DX NetOps Performance Management documentation for more information.

Additional Information

DcDebug is the built-in discovery and polling debug tool. You can access and use it as follows:

Point the browser URL to: http://<DA_HOST>:8581/dcdebug/searchdebug.html
Enable detailed poll logging for the IP you need monitored and detailed SNMP logging.
The data for each successive poll will then appear on screen.