Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX Platform 2x

Resolution

1.Checklist

IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper

1) Check if kafka brokers are connected to zookeeper

If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal

Otherwise, you can ssh the zookeeper pod:

kubectl get pods -n<dxi-namespace> | grep zookeeper

kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>

cd /opt/ca/zookeeper/bin

./zkCli.sh
ls /brokers/ids

Expected results: It displays the number of kafka brokers connected to zookeeper
If you have a medium elastic deployment, the result should be: [0, 1, 2] as below:

If you have a medium elastic deployment and you see only 1 or 2 brokers listed then it means that some kafka brokers are having issues (are down or have disconnected from zookeeper)

Recommendations:

a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.

kubectl get pods -n<dxi-namespace> | grep kafka

b) Restart the problematic kafka pods:

-Find out which are the problematic kafka pods to restart:
(In Openshift) Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable, below is an example illustrating which kafka pod correspond to broker#2

(in kubernetes) : kubectl describe po <kafka pod> -n<namespace>

- Once you have identified the problematic pods:

(In Openshift) click Actions > "Delete"
(In Kubernetes) : kubectl delete po <kafka pod> -n<namespace>

2) Check the zookeeper logs, search for: ERROR or WARN

Zookeeper logs are available from:

a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs
c) You can use oc or kubectl as below:

kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl logs <zookeeper-pod> -n<dxi-namespace>

OR

kubectl logs --tail=200 <zookeeper-pod> -n<dxi-namespace>

Here is an example when the zooKeeper disk write duration exceeds 1s:

WARN [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html

DX AIOps - How to check the Zookeeper to Kafka connectivity

Article ID: 272229

Updated On: 10-04-2023