AIOps - kafka data consuming all disk space in Elastic nodes
search cancel

AIOps - kafka data consuming all disk space in Elastic nodes

book

Article ID: 222125

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

Symptoms:

1) Spikes in OI Metric Publisher and Metadata

 

2) Jarvis kafka size keeps growing very quickly consuming all disk space in Elastic nodes

From Elastic nodes, /dxi/jarvis/kafka folder

du -h --max-depth=1 | sort -h

After 24 hours, size of files increased more than 100%

What is the purpose of this data? How can we reduce the size? Why data is not processed? How to fix the problem?

 

Environment

DX Operational Intelligence 20.x
DX Application Performance Management 20.x
DX AXA 20.x

Cause

apmservices-OIMetricPublisher writes to the kafka topic 1MinMetrics*.

This topic is not read by Jarvis but is processed by dsp-integrator component for anomaly detection. The number of metrics exported to this topic is controlled through oimetricpublisher regex configurations and is probably too large generating so much metrics data

You can identify this condition from Cluster Management > Metrics View > apmservices | oimetricpublisher | 001 | 0I Metric Publisher : Metrics Processed Per Interval

In this example, you can see that the amount of metrics exported to "1MinMetrics" topic is more than million causing the issue

 

Resolution

1) Change oimetricpublisher configuration (from Cluster Manager) to reduce the number of metrics configured 

Go to Cluster Manager (login as masteradmin)

Go to Cluster Settings, locate the below properties:

apm.oimetricpublisher.profiles.tier1.regex.0.attribute 

apm.oimetricpublisher.profiles.tier2.regex.0.attribute 

apm.oimetricpublisher.profiles.tier3.regex.0.attribute 

Update value

Business Segment\|.*|By Business Service\|.*|Frontends\|Apps\|.*|By Frontend\|[^|]+\|Health:.*|CPU\|Processor.*:Utilization % \(aggregate\)|CPU:Utilization % \(process\)|GC Monitor.*|GC Heap.*|(.*)\|(Business Process|Business Service)\|(.*)\|Business Transactions\|(.*):(.*)|EJB\|(.*):Average Method Invocation Time \(ms\)|Backends(.*)|Frontends\|Messaging Services(.*)|JNDI(.*)|WebServices(.*)|Threads(.*)|Oracle Databases(.*) 

With 

CPU\|Processor.*:Utilization % \(aggregate\)|CPU:Utilization % \(process\)|GC Monitor.*|GC Heap.*|

 

Apply the same change for the above 3 properties


2) Reduce the kafka retention from default 24 hrs to 12 hrs or smaller for 1minMetrics topic

a) connect to a kafka pod
 
If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal
Otherwise, you can ssh any of the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kafka
kubectl exec -ti <kafka-pod> sh -n<dxi-namespace>

b) First reduce retention for the topic to a very small number to delete the existing kafka data

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --alter --topic 1minMetrics --config retention.ms=1000

Verify the change has been applied successfully:

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 describe --topic 1minMetrics

wait for 10 to 15 minutes, then check that kafka size has been reduced using: du -h --max-depth=1 | sort -h

for example:

c) finally, set retention to 4 hours (default 24 hours)

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --alter --topic 1minMetrics --config retention.ms=14400000

Verify the change has been applied successfully using: /opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 describe --topic 1minMetrics

Additional Information

DX Platform - Jarvis (kafka, zookeeper, elasticSearch) Troubleshooting