Symptoms:
1) Spikes in OI Metric Publisher and Metadata
2) Jarvis kafka size keeps growing very quickly consuming all disk space in Elastic nodes
From Elastic nodes, /dxi/jarvis/kafka folder
du -h --max-depth=1 | sort -h
After 24 hours, size of files increased more than 100%
What is the purpose of this data? How can we reduce the size? Why data is not processed? How to fix the problem?
DX Operational Intelligence 20.x
DX Application Performance Management 20.x
DX AXA 20.x
apmservices-OIMetricPublisher writes to the kafka topic 1MinMetrics*.
This topic is not read by Jarvis but is processed by dsp-integrator component for anomaly detection. The number of metrics exported to this topic is controlled through oimetricpublisher regex configurations and is probably too large generating so much metrics data
You can identify this condition from Cluster Management > Metrics View > apmservices | oimetricpublisher | 001 | 0I Metric Publisher : Metrics Processed Per Interval
In this example, you can see that the amount of metrics exported to "1MinMetrics" topic is more than million causing the issue
1) Change oimetricpublisher configuration (from Cluster Manager) to reduce the number of metrics configured
Go to Cluster Manager (login as masteradmin)
Go to Cluster Settings, locate the below properties:
apm.oimetricpublisher.profiles.tier1.regex.0.attribute
apm.oimetricpublisher.profiles.tier2.regex.0.attribute
apm.oimetricpublisher.profiles.tier3.regex.0.attribute
Update value
Business Segment\|.*|By Business Service\|.*|Frontends\|Apps\|.*|By Frontend\|[^|]+\|Health:.*|CPU\|Processor.*:Utilization % \(aggregate\)|CPU:Utilization % \(process\)|GC Monitor.*|GC Heap.*|(.*)\|(Business Process|Business Service)\|(.*)\|Business Transactions\|(.*):(.*)|EJB\|(.*):Average Method Invocation Time \(ms\)|Backends(.*)|Frontends\|Messaging Services(.*)|JNDI(.*)|WebServices(.*)|Threads(.*)|Oracle Databases(.*)
With
CPU\|Processor.*:Utilization % \(aggregate\)|CPU:Utilization % \(process\)|GC Monitor.*|GC Heap.*|
Apply the same change for the above 3 properties
2) Reduce the kafka retention from default 24 hrs to 12 hrs or smaller for 1minMetrics topic
b) First reduce retention for the topic to a very small number to delete the existing kafka data
/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --alter --topic 1minMetrics --config retention.ms=1000
Verify the change has been applied successfully:
/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 describe --topic 1minMetrics
wait for 10 to 15 minutes, then check that kafka size has been reduced using: du -h --max-depth=1 | sort -h
for example:
c) finally, set retention to 4 hours (default 24 hours)
/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --alter --topic 1minMetrics --config retention.ms=14400000
Verify the change has been applied successfully using: /opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 describe --topic 1minMetrics