DX AIOps - How to check the Zookeeper to Kafka connectivity
search cancel

DX AIOps - How to check the Zookeeper to Kafka connectivity

book

Article ID: 272229

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX Platform 2x

Resolution

1.Checklist

 

IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper
 
1) Check if kafka brokers are connected to zookeeper
 
If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal
Otherwise, you can ssh the zookeeper pod:

kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>
 
cd /opt/ca/zookeeper/bin
./zkCli.sh
ls /brokers/ids
 
Expected results: It displays the number of kafka brokers connected to zookeeper 
If you have a medium elastic deployment, the result should be: [0, 1, 2] as below:
 
 
If you have a medium elastic deployment and you see only 1 or 2 brokers listed then it means that some kafka brokers are having issues (are down or have disconnected from zookeeper)

Recommendations:

a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.
 
kubectl get pods -n<dxi-namespace> | grep kafka
 

b)
Restart the problematic kafka pods:

-Find out which are the problematic kafka pods to restart:
(In Openshift) Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable, below is an example illustrating which kafka pod correspond to broker#2

(in kubernetes) : kubectl describe po <kafka pod> -n<namespace>
 
 
- Once you have identified the problematic pods:
 
(In Openshift) click Actions > "Delete"
(In Kubernetes) : kubectl delete po <kafka pod> -n<namespace>
 
 
 
2) Check the zookeeper logs,  search for: ERROR or WARN
 
Zookeeper logs are available from:
 
a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs
c) You can use oc or kubectl as below:
kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl logs <zookeeper-pod> -n<dxi-namespace>
OR
kubectl logs --tail=200 <zookeeper-pod> -n<dxi-namespace> 
 
 
Here is an example when the zooKeeper disk write duration exceeds 1s:

WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
 
 

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

 
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
 

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html