How to confirm a Data Collector is collecting data

Products

Network Observability CA Performance Management

Issue/Introduction

How do determine if the Data Collector (DC) is collecting polled metric data when the Data Aggregator (DA) it connects to is down.

We'd like to know how to confirm that the DC is still collecting data even when the DC is not able to connect to the other nodes.

How to track polled metric data burn down rates, monitoring it's submission to the DA for database insertion.

Environment

All supported DX NetOps Performance Management releases

Resolution

The Data Collector (DC) will collect data if it can't connect to the Data Aggregator (DA) as long as the DC services remain running.

If the Data Collector is down and the Data Aggregator is down, no data will be cached. In that scenario the Data Aggregator will need to started before the Data Collector can start.

We can also check the DC (default path shown) /opt/IMDataCollector/apache-karaf-2.4.3/data/log/PollSummary.log file. This logs each poll cycle per Metric Family/poll group.

If you add the following config you can monitor the cache burndown when the DC is processing data to the DA. No restart is needed for org.ops4j.pax.logging.cfg file changes, they are read in by the dcmd service "on the fly".

In DX Netops Performance Management 21.2.1 and earlier

In /opt/IMDataCollector/apache-karaf-2.4.3/etc/org.ops4j.pax.logging.cfg, add:

log4j.logger.com.ca.im.core.jms.health.JmsBrokerHealthAnalyser=DEBUG,sift
log4j.additivity.com.ca.im.core.jms.health.JmsBrokerHealthAnalyser=false
In 21.2.2 and later:

In /opt/IMDataCollector/apache-karaf/etc/org.ops4j.pax.logging.cfg uncomment:

#
# JMS Health logging
#
log4j2.logger.JMSHealth.name = com.ca.im.core.jms.health log4j2.logger.JMSHealth.level = DEBUG log4j2.logger.JMSHealth.appenderRef.sift.ref = sift

This will create a log file named:

com.ca.im.common.core.jms.log

Under:

/opt/IMDataCollector/apache-karaf-*/data/log

Disable by commenting out the uncommented lines added to the file. Would look like this before saving the changes.

#log4j2.logger.JMSHealth.name = com.ca.im.core.jms.health #log4j2.logger.JMSHealth.level = DEBUG #log4j2.logger.JMSHealth.appenderRef.sift.ref = sift

This is an example where we can see:

Memory limit in the broker is 10MB
Memory usage in the broker is 1.43MB
Disk usage for non-persistent messages is 12.86MB
Disk usage for non-persistent messages is 2.47GB
Cached messages (pending to deliver) are 14381

2021-04-10 18:32:21,791 | DEBUG | pool-14-thread-1 | JmsBrokerHealthAnalyser | s.health.JmsBrokerHealthAnalyser 149 | 179 - com.ca.im.common.core.jms - 20.2.9.RELEASE-542 | | JMS Health Statistics => Memory: 1.43MB/10.00MB, Disk: 12.86MB/2.47GB, Pending: 14381 msgs, Enqueue: 0 msg/sec, Dequeue: 0 msg/sec, Delay: -100 secs, Dropped: 0 msgs

Based on the received statistics from the broker, Data Collector drops messages from the broker to control disk usage.

Data Collector establishes disk limit based on the minimum between:

50% of the Data Collector JVM max heap
90% of the free space in the file system corresponding to the Data Collector {java.home}

In the example above, disk limit calculation resulted in 2.47GB (in this case, 50% of the Data Collector JVM max heap won)

When Data Collector detects disk usage is higher than disk limit, it begins to drop cached messages.

Because Data Collector and broker could reside in different filesystems, Data Collector considers also if broker filesystem usage is greater than 85% to start dropping cached messages. Broker filesystem usage information arrives as part of the regular statistics.

Once the Data Aggregator is back online, cached messages are delivered (Pending=0). A quick restart of the broker releases the disk space.

2021-04-10 18:39:51,814 | DEBUG | pool-14-thread-1 | JmsBrokerHealthAnalyser | s.health.JmsBrokerHealthAnalyser 149 | 179 - com.ca.im.common.core.jms - 20.2.9.RELEASE-542 | | JMS Health Statistics => Memory: 0/10.00MB, Disk: 12.96MB/2.47GB, Pending: 0 msgs, Enqueue: 0 msg/sec, Dequeue: 0 msg/sec, Delay: 0 secs, Dropped: 0 msgs

Additional Information

KB Article: How the Data Collector caches data when the Data Aggregator is down

KB Article: Configure Data Collector disk space allocation for polled data cache during DA outage