Data Collector Hang or Crash due to MVEL SNMP calls

book

Article ID: 37279

calendar_today

Updated On:

Products

CA Infrastructure Management CA Infrastructure Management CA Performance Management - Usage and Administration CA Performance Management - Data Polling

Issue/Introduction

Issue:

Data Collector appears to hang or crashes in r2.7 Early Access/GA (build 133). Or devices could be missing polled data.  This is due to too many threads and/or ports being created to service MVEL SNMP calls. This excessive consumption of resources can manifest in loss of polled data for some or all devices.

Symptoms:

  • There are devices that are using a Vendor Certification that uses the new snmpGet  and snmpGetTable MVEL functions. Out of box, vendor certifications using snmpGetTable include CiscoQosClassMapCounter64Mib and CiscoQosClassMapMib.
  • Since the problem affects the DC as a whole, it may manifest itself as a loss of polled data for any polled devices, even those that are not being polled using advanced MVEL snmpGet and snmpGetTable functions.
  • In Data Collector karaf.log, karaf.out, or Exception.log, you may see:    

                java.lang.OutOfMemoryError: unable to create new native thread

                at java.lang.Thread.start0(Native Method)[:1.8.0_60]

                at java.lang.Thread.start(Thread.java:714)[:1.8.0_60]

                at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)[:1.8.0_60]

                at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)[:1.8.0_60]

                at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:415)[:1.8.0_60]

                at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)[:1.8.0_60]

                at java.lang.Thread.run(Thread.java:745)[:1.8.0_60]

  • Running the following command yields a large number of open UDP ports on the DC,  netstat -apn | grep java | grep udp | wc -l.  A large number is a number greater than the number of devices using any Vendor Certifications leveraging the advanced MVEL expressions - plus 1 for the main SNMP session.

  • Running the following Query will provide the number devices using any Vendor Certifications leveraging the advanced MVEL expressions

    SELECT d.hostname, v6_ntoa(d.primary_ip_address) as "address", i.item_id  

    FROM item i

      JOIN v_item_facet vi ON i.item_id = vi.item_id

      JOIN poll_item pi ON i.item_id = pi.item_id

      JOIN device d ON pi.device_item_id = d.item_id

    WHERE vi.facet_qname LIKE '%CiscoQosClassMap%';

  • Check the karaf.out file after doing a stack dump (kill -3 <karaf pid>) yields a large number of 'Resoinse Pool' threads.  grep "Response Pool.0" karaf.out | wc -l

 

Environment:

2.7

Cause:  

This condition is only seen if there is a device actively being polled using advanced MVEL expressions that perform additional snmpgets / snmpbulkgets in variables or expressions. Out-of-box, there are two Qos class map certifications that use such expressions. Customers should confirm that they either have devices being polled using the QoS vendor certifications, or identify that they have some custom certification or extension that uses such expressions.  See the “Symptoms” section noted above to determine if you are affected by this condition.

 

Workaround:

Temporary workaround (prior to upgrading the Data Collector)

  1. cd /opt/IMDataCollector/apache-karaf-2.3.0/etc 
  2. Create a file called com.ca.im.dm.certs.snmp.SNMPGetHelper.cfg 
  3. Add the following entry into the file: sessionExpiry=32000000 
  4. Save the file 
  5. Restarting the Data Collector is not necessary for the setting to be applied but is recommended. However, existing resources - (open ports and threads) will not be cleaned up until the Data Collector is restarted.

Resolution:

An updated DC installer is being provided on the download portal for the CA PM r2.7 product. This installer MUST be used to upgrade all Data Collectors when installing r2.7 to prevent any data loss. 

The fix will also be available in the first monthly update kit for r2.7.

Environment

Release:
Component: IMPOLL