Best practice for resolving CA Performance Center when graph chart plot data lack/missing
search cancel

Best practice for resolving CA Performance Center when graph chart plot data lack/missing

book

Article ID: 10743

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

Sometimes, the metric charts or reports for some items/devices in CA Performance Center display display gaps indicating interrupted polling or responses. The following are the main cause of missing data in CAPC graph. 

  • A: SNMP timeout (device is not responding or delaying to polling)
  • B: The device only supports SNMP 32bit counter.
  • C: CA Data Aggregator is slow with high load.

Environment

OS: RHEL 7.x and 6.x

Cause

To determine what the cause is, check the following criteria:

  1. If any error/s are generated during the time the issue occurs in the following Data Aggregator and Data Collector logs: 

    /opt/IMDataAggregator/apache-karaf-*/data/log/*
    /opt/IMDataCollector/apache-karaf-*/data/log/*

  2. The "Number of Event Rules Evaluated" and "Percentage of Poll Cycle of Complete Event Processing" chart in the Data Aggregator Pages and the Data Aggregator health charts on the CA Performance Center System Health tab.

  3. Run DcDebug (* See Additional Information) for the problem device.

  4. Confirm if the monitored device physically changed. 
  • Whether the device Status does not set as "Management Lost" in the CA Performance Center Administration menu > Monitored Devices > select the device > Details tab 
  • Whether the SNMP Poll Rate does not set as "true-null" in the CA Performance Center Administration menu > Monitored Devices > select the device > Polled Metric Families tab > select Interface Metric Family line > see  right below pane Components list 

 

Resolution

Cause A: SNMP timeout (device is not responding or delaying to polling)

If the following error appeared in the "Poll Errors by IP" log at the DcDebug -- then its Type A.

        POLLING_ERROR: errors for cycle 1491007500000[REQUEST_TIMED_OUT]

By default, the maximum response time set is 9 seconds: Broadcom TechDocs : DX NetOps 20.2.x CAPM - Modify the Timeout and Retries Parameters

Moreover frequent SNMP time-out occurrence generates CA Data Aggregator polling stop event: Broadcom TechDocs : DX NetOps 20.2.x CAPM - Polling Stopped Event Message

If these errors or issues are the cause, then look to increase the Timeout and/or Retries parameter of the CA Performance Center SNMP Profile used for these items/devices.

  

Cause B: The device only supports SNMP 32bit counter

If the following WARN message appears in the Data Collector karaf log -- then its Cause B:

com.ca.im.data-collection-manager.core.interfaces - | | Counter value rolled over, dropping response: previous=4285888934 / current=4049163 for IP IP address, OID polling OID, item ID id, in poll group gid
Further counter rollover messages for this IP will be suppressed unless DEBUG is enabled or the DC is restarted.


When the SNMP Counter rollover occurs within one polling cycle, the polling data will be missing since the counter the metric is based is no longer valid as per: Configure Counter Behavior

This may happens when the monitoring device supports only 32bit counters or only SNMPv1. The following error may appeared in the "Discover Logging by IP" log at the DcDebug when getting SNMP 64bit counter MIB for those box.   

       Finished on demand read. Response = SnmpResponse [error=SNMP_PARTIAL_FAILURE, errorIndex=-1, queriedIP=Device IP]
       ? SnmpResponseVariable [oid=Polling OID, type=NULL, value={}, isDelta=false, isList=true, error=NO_SUCH_NAME, isDynamicIndex=false, indexList=[]]

A possible workaround is to shorten the poll interval from 5 minutes to 1 for the device: Poll Critical Interfaces Faster than Non-critical Interfaces

  

Cause C: CA Data Aggregator is slow with high load

If the following WARN message appeared in the Data Aggregator karaf log -- then its Cause C.

WARN | tory-thread-id | date time | onitoringProcessLimitManagerImpl | onitoringProcessLimitManagerImpl 98 | .ca.im.aggregator.loader | | Threshold Monitoring processing took too long. The system will shut that feature down in 15 minutes if the threshold monitoring continues to exceed capcacity

 

And at the same time the following event occurs and is shown in the CA Performance Center Event List:

       The Threshold Monitoring Engine has transitioned to a degraded state.

 

You will also see the some peak in the following graph chart at the event time.

  • The "Number of Event Rules Evaluated" and "Percentage of Poll Cycle of Complete Event Processing" chart in the Data Aggregator Pages

 

If the above is the case, then look to increase the PercentOfPollCycleThreshold value: Threshold Monitoring and Threshold Limiter Behavior  

Additional Information

DcDebug is the built-in discovery and polling debug tool.  You can access and use it as follows:

  1. Point the browser URL to: http://<DA_HOST>:8581/dcdebug/searchdebug.html

  2. Enable detailed poll logging for the IP you need monitored and detailed SNMP loging.

  3. The data for each successive poll will then appear on screen.