DX AIOps - How to check Kafka Health
search cancel

DX AIOps - How to check Kafka Health

book

Article ID: 272231

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX AIops 2x

Resolution

1.Checklist

 

1) Check the kafka logs,  search for: ERROR or WARN
 
Kafka logs are available from:
 
a) <NFS>/jarvis/kafka-logs/kafka-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Logs
 
Here is an example when kafka is not able to send the heart beat on time affecting the kafka to zookeeper connectivity:
 
Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect
 
 
 
2)  Check if there is a LAG processing the data
 
 
a) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal
Otherwise, you can ssh to any of the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kafka
kubectl exec -ti <kafka-pod> sh -n<dxi-namespace>
 
b) Execute the below commands to identify if there is a LAG: 
 
 
List all available topics:

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe 

List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list
Check for a possible LAG in jarvis Recommendation: Verify that  column LAG is not always > 0

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier


If AXA data is impacted (ie: data displayed but late), see https://knowledge.broadcom.com/external/article/238179 

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group axa.transformer

 

Here is an example illustrating a LAG condition:

Here is an example illustrating the consumers disconnection condition:

Recommendation:  Restart jarvis services as below:

Scale down:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Scale up:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Below is the list of kubectl commands :

a) Scale down the following deployments:

kubectl scale --replicas=0 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=0 deployment jarvis-lean-jarvis-indexer  -n<namespace>
kubectl scale --replicas=0 deployment jarvis-indexer -n<namespace>

b) Verify that all pods are down:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

c) Scale up deployments 

kubectl scale --replicas=1 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=1 deployment jarvis-indexer -n<namespace>d

d) Verify that all pods are up and running:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

e) Verify alarms and servicenow incidents are reported as expected

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

 
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html