IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper
1) Check if kafka brokers are connected to zookeeper
If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal
Otherwise, you can ssh the zookeeper pod:
kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>
cd /opt/ca/zookeeper/bin
./zkCli.sh ls /brokers/ids
Expected results: It displays the number of kafka brokers connected to zookeeper If you have a medium elastic deployment, the result should be: [0, 1, 2] as below:
If you have a medium elastic deployment and you see only 1 or 2 brokers listed then it means that some kafka brokers are having issues (are down or have disconnected from zookeeper)
Recommendations:
a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.
kubectl get pods -n<dxi-namespace> | grep kafka
b) Restart the problematic kafka pods:
-Find out which are the problematic kafka pods to restart: (In Openshift) Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable, below is an example illustrating which kafka pod correspond to broker#2
(in kubernetes) : kubectl describe po <kafka pod> -n<namespace>
- Once you have identified the problematic pods:
(In Openshift) click Actions > "Delete" (In Kubernetes) : kubectl delete po <kafka pod> -n<namespace>
2) Check the zookeeper logs, search for: ERROR or WARN
Zookeeper logs are available from:
a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs c) You can use oc or kubectl as below:
Here is an example when the zooKeeper disk write duration exceeds 1s:
WARN [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2.What to collect if the problem persist?
If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support: