DX Platform - Jarvis (kafka, zookeeper, elasticSearch) Troubleshooting

book

Article ID: 189119

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

Symptoms:

- New alarms don't appear in OI Console
- ServiceNow tickets creation is not working
- Alarms are not getting updated
- New tenants don't appear in Elastic queries, ERROR 400 and 500 in OI Service and Performance Analytics 

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis (kafka, zookeeper, elasticsearch) common performance and configuration issues.

A) Checklist 
B) What diagnostic files should I gather for CA Support?

 

Environment

DX Operational Intelligence 1.3.x, 20.x
DX Application Performance Management 11.x, 20.x

Resolution

 

A) Checklist

 

OVERVIEW

 

PREQUISITES

Find out the  "APIS_URL" (Jarvis API) and "ES_URL" (ElasticSearch) endpoints

If DX OI 20.x:

If Kubernetes:  kubectl get ingress -n<dxi-namespace> | grep jarvis

for example: kubectl get ingress -ndxi | grep jarvis
jarvis-apis                           <none>   apis.10.109.32.88.nip.io                        10.109.32.88   80      19d
jarvis-es                             <none>   es.10.109.32.88.nip.io                           10.109.32.88   80      19d

If Openshift:     oc get routes -n<dxi-namespace> | grep jarvis

for example: oc get routes -ndxi | grep jarvis
jarvis-apis-jwkr5                           apis.munqa001493.bpc.broadcom.net                         /                  jarvis-apis                   8080                                          None
jarvis-es-7krrv                             es.munqa001493.bpc.broadcom.net                          /                  jarvis-elasticsearch-lb       9200                                          None

If DX OI 1.3.2:

oc get routes -n<dxi-namespace> | grep apis
oc get routes -n<dxi-namespace> | grep elastic

for example:
oc get routes -ndoi132 | grep apis

jarvis-route-8080             jarvis.lvntest010772.bpc.broadcom.net                                apis                  8080                    None

oc get routes -ndoi132 | grep elastic

es-route-9200                  es.lvntest010772.bpc.broadcom.net                                    elasticsearch         9200                    None

CHECK #1 : Jarvis Health


1) Make sure all Jarvis services are in green status

Go to http(s)://<APIS_URL>/#/All/get_health
Click, "Try it out", "Execute"
 
Expected Result: All jarvis services should report status = green


 
 
If "elasticsearch" report status = red (as below), then check "Elastic Health", see next section:
 
 
 

CHECK #2 : ElasticSearch Health

 
1) Execute below queries:
 
Description Syntax Example
Check Elastic Status (make sure status" : "green") http(s)://<ES_URL>/_cluster/health?pretty&human http://es.munqa001493.bpc.broadcom.net/_cluster/health?pretty&human
Check affected indices due to unassigned shards  http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st http://es.munqa001493.bpc.broadcom.net/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
Displays nodes in cluster (check memory, cpu, load) http(s)://<ES_URL>/_cat/nodes?v http://es.munqa001493.bpc.broadcom.net/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues http(s)://<ES_URL>/_cluster/allocation/explain?pretty http://es.munqa001493.bpc.broadcom.net/_cluster/allocation/explain?pretty
 
For more Elastic query options see DX AIOps - ElasticSearch Queries
 
Recommended check:
 
a) check if there are unassigned_shards as below, run: http(s)://<ES_URL>/_cluster/health?pretty&human
 
 
b) If unassigned_shards is > 0, run below 2 queries:
 
http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
 
it will give you more details of the affected indices:
 
 
http(s)://<ES_URL>/_cluster/allocation/explain?pretty
 
it will give you more details of the why allocation failed
 
 
Solution:

a) Open Postman

b) Run below POST rest call to reassign failed shards, leave the body empty

http(s)://<ES_URL>/_cluster/reroute?retry_failed=true

 
 
2) Check ElasticSearch logs, search for WARN or ERRORs, a common issue is the lack of memory:
 
Below messages indicate frequent garbage collections getting executed affecting ES performance:
 
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations: you have 2 options:

a) Restart the ElasticSearch pods

- Go to Openshift > Application > pods
- Locate the ElasticSearch pods
- Delete each pod -- new ones will be created.

b) Increase memory on each of the ElasticSearch deployments

- Go to Openshift > Application > deployments
- Locate the ElasticSearch deployments
- Click Actions > Edit Resource Limits

- Increase "Limit" by half  (NOTE: make sure you have enough memory available in the elastic server, you can use: free -h
- Click Save -- a new pod will be created

3) Check for disk space issues in Elastic nodes and NFS

Here are some of example of messages indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue

tion [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
[16]: index [jarvis_jmetrics_1.0_1], type [_doc], id [3f1f4151-4b4a-4a04-bde1-4ae610358e81], message [ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:

a) Increase disk size in ElasticSearch nodes

b) Reduce data retention, Delete elastic backups Or manually delete some unnecessary elastic indices, see:

DX AIOps - NFS or Elastic Nodes disk full - How to reduce data retention
OI 20.x  : https://knowledge.broadcom.com/external/article/207161
OI 1.3.x : https://knowledge.broadcom.com/external/article/188786

 

CHECK #3 : Zookeeper to Kafka connectivity

 
IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper
 
1) Check if kafka brokers are connected to zookeeper
 
If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal
Otherwise, you can ssh the zookeeper pod:

kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>
 
sh /opt/ca/zookeeper/bin/zkCli.sh
ls /brokers/ids
 
Expected results: You should see all your kafka nodes connected to zookeeper 
If you have 3 elastic nodes, the result should be: [1, 2, 3] as below:
 
 
If the result is [1], then it means that kafka brokers 2 and 3 are down or have disconnected from zookeeper.
If the result is [1, 2], then kafka 3 is the problematic node.
If the result is [2], then it means that broker 1 and 3 are the problematics, etc...
 
Recommendations:

a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.
 
kubectl get pods -n<dxi-namespace> | grep kafka
 
b) Restart the problematic kafka pods:

-Find out which are the problematic kafka pods to restart:
Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable. 

Below is an example illustrating which kafka pod correspond to broker#2
 
- Once you have identified the problematic pods, click Actions > "Delete".
 
c) If you are using DOI 1.3.x:
 
Add the below 2 properties to minimize the kafka to zookeeper connectivity issues. 
NOTE: these properties are already part of OI 20.2 

- Go to Openshift console | Applications | Deployments | <kafka deployment> | Environment tab
 
- Add:
kafkaserverprops_unclean_leader_election_enable=true   (this will make sure every partition will be assigned with leader,after controller changed)
kafkaserverprops_zookeeper_session_timeout_ms=30000 (this will make sure that Kafka tries to connect to ZK for at least 30s rather than 6 seconds by default)
 
 
- Save the changes, new pods will be created
 
2) Check the zookeeper logs,  search for: ERROR or WARN
 
Zookeeper logs are available from:
 
a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs
c) You can use oc or kubectl as below:
kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl logs <zookeeper-pod> -n<dxi-namespace>
OR
kubectl logs --tail=200 <zookeeper-pod> -n<dxi-namespace> 
 
 
Here is an example when the zooKeeper disk write duration exceeds 1s:

WARN  [SyncThread:3:[email protected]] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
 
 

CHECK #4 : Kafka

 
1) Check the kafka logs,  search for: ERROR or WARN
 
Kafka logs are available from:
 
a) <NFS>/jarvis/kafka-logs/kafka-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Logs
 
Here is an example when kafka is not able to send the heart beat on time affecting the kafka to zookeeper connectivity:
 
INFO  [main-SendThread(zookeeper.doigivaudan.svc.cluster.local:2181):[email protected]] - Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect
 
 
 
2)  Check if there is a LAG processing the messages
 
 
a) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal
Otherwise, you can ssh any of the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kafka
kubectl exec -ti <kafka-pod> sh -n<dxi-namespace>
 
b) Execute the below commands to identify if there is a LAG: 
 
DX OI 20.2+:
 
List all available topics:

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe 

List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list
Check for a possible LAG in jarvis Recommendation: Verify that  column LAG is not always > 0

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier

 

DX  OI 1.3.x:

List all available topics: /opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list
List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092,kafka-2:9092,kafka-3:9092 --list
Check for a possible LAG in jarvis Recommendation: Verify that  column LAG is not always > 0 /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092,kafka-2:9092,kafka-3:9092 --describe --group jarvis_indexer

 

Here is an example illustrating a LAG condition:

Here is an example illustrating the consumers disconnection condition:

Recommendation:  Restart jarvis services as below:

Scale down:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Scale up:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Below is the list of kubectl commands :

a) Scale down the following deployments:

kubectl scale --replicas=0 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=0 deployment jarvis-lean-jarvis-indexer  -n<namespace>
kubectl scale --replicas=0 deployment jarvis-indexer -n<namespace>

b) Verify that all pods are down:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

c) Scale up deployments 

kubectl scale --replicas=1 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=1 deployment jarvis-indexer -n<namespace>d

d) Verify that all pods are up and running:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

e) Verify alarms and servicenow incidents are reported as expected

 

B) What to collect if the problem persist?

If after applying the above recommendations the problem persist, collect the below logs and contact Broadcom Support:

 
If OI 20.2:
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
 
If OI 1.3.2:
<NFS>/jarvis/apis-logs/*.log
<NFS>/jarvis/indexer-logs/*.log
<NFS>/jarvis/elasticsearch-logs/*.log
<NFS>/jarvis/kafka-logs/*.log
<NFS>/jarvis/utils-logs/*.log
<NFS>/jarvis/zookeeper-logs/*.log

Additional Information


DX AIOps - Troubleshooting, Common Issues and Best Practices
https://knowledge.broadcom.com/external/article/190815/dx-oi-troubleshooting-common-issues-and.html

Attachments