DX Platform - Jarvis (kafka, zookeeper, elasticSearch) Troubleshooting

book

Article ID: 189119

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

You might see one or more of the below symptoms:

- New alarms don't appear in OI Console
- ServiceNow tickets creation not working
- Alarms are not getting updated
- New tenants don't appear in Elastic queries, ERROR 400 and 500 in OI Service and Performance Analytics 

The following is a high list of techniques and suggestions you can employ when troubleshooting performance, display and configuration issues related to Jarvis, kakf, Elastic OI, AXA and APM.

Environment

DX Operational Intelligence 1.3.x, 20.x
DX Application Performance Management 11.x, 20.x

Resolution

 

PREQUISITES

Find out the  "APIS_URL" (Jarvis API) and "ES_URL" (ElasticSearch) endpoints

If DX OI 20.x:

If Kubernetes:  kubectl get ingress -n<dxi-namespace> | grep jarvis

for example: kubectl get ingress -ndxi | grep jarvis
jarvis-apis                           <none>   apis.10.109.32.88.nip.io                         10.109.32.88   80      19d
jarvis-es                             <none>   es.10.109.32.88.nip.io                           10.109.32.88   80      19d

If Openshift:     oc get routes -n<dxi-namespace> | grep jarvis

for example: oc get routes -ndxi | grep jarvis
jarvis-apis-jwkr5                           apis.munqa001493.bpc.broadcom.net                         /                  jarvis-apis                   8080                                          None
jarvis-es-7krrv                             es.munqa001493.bpc.broadcom.net                           /                  jarvis-elasticsearch-lb       9200                                          None

If DX OI 1.3.2:

oc get routes -n<dxi-namespace> | grep apis
oc get routes -n<dxi-namespace> | grep elastic

for example:
oc get routes -ndoi132 | grep apis

jarvis-route-8080              jarvis.lvntest010772.bpc.broadcom.net                                apis                  8080                    None

oc get routes -ndoi132 | grep elastic

es-route-9200                  es.lvntest010772.bpc.broadcom.net                                    elasticsearch         9200                    None

STEP #1 : Check Jarvis Services Health


1) Make sure all Jarvis services are in green status

Go to http(s)://<APIS_URL>/#/All/get_health
Click, "Try it out", "Execute"


 
2) If some services are in yellow or red, then check the respective logs:

If OI 20.2:
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
 
If OI 1.3.2:
<NFS>/jarvis/apis-logs/*.log
<NFS>/jarvis/indexer-logs/*.log
<NFS>/jarvis/elasticsearch-logs/*.log
<NFS>/jarvis/kafka-logs/*.log
<NFS>/jarvis/utils-logs/*.log
<NFS>/jarvis/zookeeper-logs/*.log
 
3) In NFS and ElasticSearch nodes : check if the problem is related to Disk space
 
Execute : df -h

See also:
DX OI - How to reduce Elastic data retention - NFS or Elastic Drive Full
OI 20.x  : https://knowledge.broadcom.com/external/article/207161
OI 1.3.x : https://knowledge.broadcom.com/external/article/188786
 
 

STEP #2 : Check ElasticSearch Health

 
1) Available queries:
 
Description Syntax Example
Check Elastic Status (make sure status" : "green") http(s)://<ES_URL>/_cluster/health?pretty&human http://es.munqa001493.bpc.broadcom.net/_cluster/health?pretty&human
Displays nodes in cluster (check memory, cpu, load) http(s)://<ES_URL>/_cat/nodes?v http://es.munqa001493.bpc.broadcom.net/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues http(s)://<ES_URL>/_cluster/allocation/explain?pretty http://es.munqa001493.bpc.broadcom.net/_cluster/allocation/explain?pretty
 
For more troubleshooting query options, see :
 
 
 
2) Check ElasticSearch logs, search for WARN or ERRORs
 
3) Check if the problem is related to Memory
 
Here are some of example of messages indicating frequent garbage collections getting executed affecting the ES nodes performance:
 
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations: you have 2 options:

a) As a workaround, restart the ElasticSearch pods

- Go to Openshift > Application > pods
- Locate the ElasticSearch pods
- Delete each pod -- new ones will be created.

b) Increase memory on each of the ElasticSearch deployments

- Go to Openshift > Application > deployments
- Locate the ElasticSearch deployments
- Click Actions > Edit Resource Limits

- Increase "Limit" by half  (NOTE: make sure you have enough memory available in the elastic server, you can use: free -m
- Click Save -- a new pod will be created

4) Check if the problem is related to Disk space

Here are some of example of messages indicating a problem with disk space, ElasticSearch switched to read-only mode because of the disk capacity issue

tion [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
[16]: index [jarvis_jmetrics_1.0_1], type [_doc], id [3f1f4151-4b4a-4a04-bde1-4ae610358e81], message [ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:

a) Increase disk size in ElasticSearch nodes

b) Reduce data retention, Delete elastic backups Or manually delete some unnecessary elastic indices, see:

DX AIOps - NFS or Elastic Nodes disk full - How to reduce data retention
OI 20.x  : https://knowledge.broadcom.com/external/article/207161
OI 1.3.x : https://knowledge.broadcom.com/external/article/188786

 

STEP #3 : Check Zookeeper to Kafka connectivity

 
IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper
 
1) Check if kafka brokers are connected to zookeeper

If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal
Otherwise, you can ssh the zookeeper pod:

kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>
 
sh /opt/ca/zookeeper/bin/zkCli.sh
ls /brokers/ids
 
Expected results: You should see all your kafka nodes connected to zookeeper 
If you have 3 elastic nodes, the result should be: [1, 2, 3] as below:
 
 
If the result is [1], then it means that kafka brokers 2 and 3 are down or have disconnected from zookeeper.
If the result is [1, 2], then kafka 3 is the problematic node.
If the result is [2], then it means that broker 1 and 3 are the problematics, etc...
 
Recommendations:

a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.
 
kubectl get pods -n<dxi-namespace> | grep kafka
 
b) Restart the problematic kafka pods:

-Find out which are the problematic kafka pods to restart:
Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable. 

Below is an example illustrating which kafka pod correspond to broker#2
 
- Once you have identified the problematic pods, click Actions > "Delete".
 
c) If you are using DOI 1.3.x:
 
Add the below 2 properties to minimize the kafka to zookeeper connectivity issues. 
NOTE: these properties are already part of OI 20.2 

- Go to Openshift console | Applications | Deployments | <kafka deployment> | Environment tab
 
- Add:
kafkaserverprops_unclean_leader_election_enable=true   (this will make sure every partition will be assigned with leader,after controller changed)
kafkaserverprops_zookeeper_session_timeout_ms=30000 (this will make sure that Kafka tries to connect to ZK for at least 30s rather than 6 seconds by default)
 
 
- Save the changes, new pods will be created
 
2) Check the zookeeper logs,  search for: ERROR or WARN
 
Zookeeper logs are available from:
 
a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs
c) You can use oc or kubectl as below:
kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl logs <zookeeper-pod> -n<dxi-namespace>
OR
kubectl logs --tail=200 <zookeeper-pod> -n<dxi-namespace> 
 
 
Here is an example when the zooKeeper disk write duration exceeds 1s:

WARN  [SyncThread:3:[email protected]] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
 
 

STEP #4 : Check Jarvis Kafka

 
1) Check the kafka logs,  search for: ERROR or WARN
 
Kafka logs are available from:
 
a) <NFS>/jarvis/kafka-logs/kafka-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Logs
c) You can use oc or kubectl as below:
kubectl get pods -n<dxi-namespace> | grep kafka
kubectl logs <jarvis-kafka-pod> -n<dxi-namespace>
OR
kubectl logs --tail=200 <jarvis-kafka-pod> -n<dxi-namespace> 
 
Here is an example when kafka is not able to send the heart beat on time affecting the kafka to zookeeper connectivity:
 
INFO  [main-SendThread(zookeeper.doigivaudan.svc.cluster.local:2181):[email protected]] - Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect
 
2) Find out which product or feature is causing the high data ingestion:
 
If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal
Otherwise, you can ssh the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kakfa
kubectl exec -ti <kakfa-pod> sh -n<dxi-namespace>
 
cd /opt/ca/kafka/data
du -m . | sort -n -r -k1 | sed 's/\s/{/g' | sed "s/-[0-9]*$//g" | awk -F '{' '{s[$2]+=$1 } END{for (i in s) print s[i], i}' | sort -n -k1 -r
 
 
3)  Check if there is a LAG processing the messages
 
You can use the below commands to troubleshooting issues related to topic and consumer groups
Update the "<group-name>" accordingly to the issue you are troubleshooting
 
DX OI 20.2+:
 
Description Command syntax Examples
List all available topics: /opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list  
List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list
Describe consumer group and list offset lag (number of messages not yet processed) related to given group.
Recommendation: Verify that  column LAG is no always > 0
/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group <group_name>

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier

DX  OI 1.3.x:

Description Command syntax Examples
List all available topics: /opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list  
List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092,kafka-2:9092,kafka-3:9092 --list  
Describe consumer group and list offset lag (number of messages not yet processed) related to given group.
Recommendation: Verify that  column LAG is not always > 0
/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092,kafka-2:9092,kafka-3:9092 --describe --group <group> /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092,kafka-2:9092,kafka-3:9092 --describe --group jarvis_indexer

Here is an example illustrating a LAG issue affecting jarvis_indexer:

 

Examples:


Use-Case #1:

DX OI - New alarms don't appear in OI Console
https://knowledge.broadcom.com/external/article/189463

Use-Case #2:

DX OI - ServiceNow Automatic incident creation/closure is not working
https://knowledge.broadcom.com/external/article/198666

Additional Information


DX AIOps - Troubleshooting, Common Issues and Best Practices
https://knowledge.broadcom.com/external/article/190815/dx-oi-troubleshooting-common-issues-and.html

Attachments