Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX AIops 2x

Resolution

1.Checklist

1) Check the kafka logs, search for: ERROR or WARN

Kafka logs are available from:

a) <NFS>/jarvis/kafka-logs/kafka-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Logs

Here is an example when kafka is not able to send the heart beat on time affecting the kafka to zookeeper connectivity:

Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect

2) Check if there is a LAG processing the data

a) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal

Otherwise, you can ssh to any of the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kafka

kubectl exec -ti <kafka-pod> sh -n<dxi-namespace>

b) Execute the below commands to identify if there is a LAG:

List all available topics:

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe

List all consumer groups

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list

Check for a possible LAG in jarvis Recommendation: Verify that column LAG is not always > 0

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier

If AXA data is impacted (ie: data displayed but late), see https://knowledge.broadcom.com/external/article/238179

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group axa.transformer

Here is an example illustrating a LAG condition:

Here is an example illustrating the consumers disconnection condition:

Recommendation: Restart jarvis services as below:

Scale down:
- jarvis-verifier
- jarvis-lean-jarvis-indexer
- jarvis-indexer

Scale up:
- jarvis-verifier
- jarvis-lean-jarvis-indexer
- jarvis-indexer

Below is the list of kubectl commands :

a) Scale down the following deployments:

kubectl scale --replicas=0 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=0 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=0 deployment jarvis-indexer -n<namespace>

b) Verify that all pods are down:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

c) Scale up deployments

kubectl scale --replicas=1 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=1 deployment jarvis-indexer -n<namespace>d

d) Verify that all pods are up and running:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

e) Verify alarms and servicenow incidents are reported as expected

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html

DX AIOps - How to check Kafka Health

Article ID: 272231

Updated On: