ALERT: Some images may not load properly within the Knowledge Base Article. If you see a broken image, please right-click and select 'Open image in a new tab'. We apologize for this inconvenience.

AIOps - Jarvis (kafka, zookeeper, elasticSearch) Troubleshooting

book

Article ID: 189119

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

Symptoms:

- New alarms don't appear in OI Console
- ServiceNow tickets creation is not working
- Alarms are not getting updated
- New tenants don't appear in Elastic queries, ERROR 400 and 500 in OI Service and Performance Analytics 

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis (kafka, zookeeper, elasticsearch) common performance and configuration issues.

A) Checklist 
B) What diagnostic files should I gather for CA Support?

 

Environment

DX Operational Intelligence 20.x, 21.x
DX Application Performance Management 20.x,21.x

Resolution

 

A) Checklist

 

OVERVIEW

 

CHECK #1 : Jarvis Health


1) Make sure ALL Jarvis services are in green status
 
 
If 21.3.x
 
Option 1:
 

-Locate the doi-nginx endpoint:


kubectl get ingress -ndxi | grep nginx


doi-nginx-ingress                     doi-nginx.lvntest010772.bpc.broadcom.net                            80        89d

 

-Connect to : http(s)://doi-nginx.<endpoint>/health
 
For example: https://doi-nginx.10.109.32.88.nip.io/health
 
 
Verify that all jarvis services are in green status
 
 
Option 2
 
- Connect to any kafka pod
 
   kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>
 
- Query the health page:
 
   curl -XGET 'http://jarvis-apis:8080/health'
 
 
NOTE: Jarvis APIS and ElasticSearch route/ingress endpoints are not longer available, however, you can re-create them as explained in https://knowledge.broadcom.com/external/article/226870
 
 
If 20.2.x:
 

1) Find out the  Jarvis API endpoint:

If Kubernetes:  kubectl get ingress -n<dxi-namespace> | grep jarvis

for example:

kubectl get ingress -ndxi | grep jarvis-apis
jarvis-apis                           <none>   apis.10.109.32.88.nip.io                        10.109.32.88   80      19d

If Openshift:     oc get routes -n<dxi-namespace> | grep jarvis

for example:

oc get routes -ndxi | grep jarvis-apis
jarvis-apis-jwkr5                           apis.munqa001493.bpc.broadcom.net                         /                  jarvis-apis                   8080                                          None


2) Go to http(s)://<APIS_URL>/#/All/get_health
 
Click, "Try it out", "Execute"
 
All jarvis services should report status = green


 
 
 

CHECK #2 : ElasticSearch Health

 
If 21.3.x
 
- Connect to any kafka pod
 
   kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>
 
- Query elastic, for example:
 
   curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/health?pretty&human' |sort
 
 
NOTE: Jarvis APIS and ElasticSearch route/ingress endpoints are not longer available, however, you can re-create them as explained in https://knowledge.broadcom.com/external/article/226870
 
 
If 20.2.x
 

Find out the ElasticSearch endpoint:

If Kubernetes:  kubectl get ingress -n<dxi-namespace> | grep jarvis-es

for example:

kubectl get ingress -ndxi | grep jarvis
jarvis-es                             <none>   es.10.109.32.88.nip.io             10.109.32.88   80      19d

If Openshift:     oc get routes -n<dxi-namespace> | grep jarvis-es

for example:

oc get routes -ndxi | grep jarvis
jarvis-es-7krrv                             es.munqa001493.bpc.broadcom.net                /                  jarvis-elasticsearch-lb       9200                                          None

 

Run below health queries:
 
Description Syntax Example
Check Elastic Status (make sure status" : "green") http(s)://<ES_URL>/_cluster/health?pretty&human http://es.munqa001493.bpc.broadcom.net/_cluster/health?pretty&human
Check affected indices due to unassigned shards  http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st http://es.munqa001493.bpc.broadcom.net/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
Displays nodes in cluster (check memory, cpu, load) http(s)://<ES_URL>/_cat/nodes?v http://es.munqa001493.bpc.broadcom.net/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues http(s)://<ES_URL>/_cluster/allocation/explain?pretty http://es.munqa001493.bpc.broadcom.net/_cluster/allocation/explain?pretty
 
For more Elastic query options see DX AIOps - ElasticSearch Queries
 
Recommended check:
 
a) check if there are unassigned_shards as below, run: http(s)://<ES_URL>/_cluster/health?pretty&human
 
 
b) If unassigned_shards is > 0, run below 2 queries:
 
http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
 
it will give you more details of the affected indices:
 
 
http(s)://<ES_URL>/_cluster/allocation/explain?pretty
 
it will give you more details of the why allocation failed
 
 
Solution:

a) Open Postman

b) Run below POST rest call to reassign failed shards, leave the body empty

http(s)://<ES_URL>/_cluster/reroute?retry_failed=true

 
 
2) Check ElasticSearch logs, search for WARN or ERRORs, a common issue is the lack of memory:
 
Below messages indicate frequent garbage collections getting executed affecting ES performance:
 
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations: you have 2 options:

a) Restart the ElasticSearch pods

- Go to Openshift > Application > pods
- Locate the ElasticSearch pods
- Delete each pod -- new ones will be created.

b) Increase memory on each of the ElasticSearch deployments

- Go to Openshift > Application > deployments
- Locate the ElasticSearch deployments
- Click Actions > Edit Resource Limits

- Increase "Limit" by half  (NOTE: make sure you have enough memory available in the elastic server, you can use: free -h
- Click Save -- a new pod will be created

3) Check for disk space issues in Elastic nodes and NFS

Here are some of example of messages indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue

tion [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
[16]: index [jarvis_jmetrics_1.0_1], type [_doc], id [3f1f4151-4b4a-4a04-bde1-4ae610358e81], message [ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:

a) Increase disk size in ElasticSearch nodes

b) Reduce data retention, Delete elastic backups Or manually delete some unnecessary elastic indices, see: AIOps - NFS or Elastic Nodes disk full - How to reduce data retention

 

CHECK #3 : Zookeeper to Kafka connectivity

 
IMPORTANT: Kafka nodes/brokers should always be connected to zookeeper
 
1) Check if kafka brokers are connected to zookeeper
 
If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper pod> | Terminal
Otherwise, you can ssh the zookeeper pod:

kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl exec -ti <zookeeper-pod> sh -n<dxi-namespace>
 
cd /opt/ca/zookeeper/bin
./zkCli.sh
ls /brokers/ids
 
Expected results: It displays the number of kafka brokers connected to zookeeper 
If you have a medium elastic deployment, the result should be: [0, 1, 2] as below:
 
 
If you have a medium elastic deployment and you see only 1 or 2 brokers listed then it means that some kafka brokers are having issues (are down or have disconnected from zookeeper)

Recommendations:

a) Check all kafka pods are up and running, if you have 3 elastic nodes, you should have3 kafka pods.
 
kubectl get pods -n<dxi-namespace> | grep kafka
 

b)
Restart the problematic kafka pods:

-Find out which are the problematic kafka pods to restart:
(In Openshift) Go to each of the Kafka pods > Environment Tab, check the BROKER_ID variable, below is an example illustrating which kafka pod correspond to broker#2

(in kubernetes) : kubectl describe po <kafka pod> -n<namespace>
 
 
- Once you have identified the problematic pods:
 
(In Openshift) click Actions > "Delete"
(In Kubernetes) : kubectl delete po <kafka pod> -n<namespace>
 
 
 
2) Check the zookeeper logs,  search for: ERROR or WARN
 
Zookeeper logs are available from:
 
a) <NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <zookeeper-pod> | Logs
c) You can use oc or kubectl as below:
kubectl get pods -n<dxi-namespace> | grep zookeeper
kubectl logs <zookeeper-pod> -n<dxi-namespace>
OR
kubectl logs --tail=200 <zookeeper-pod> -n<dxi-namespace> 
 
 
Here is an example when the zooKeeper disk write duration exceeds 1s:

WARN  [SyncThread:3:[email protected]] - fsync-ing the write ahead log in SyncThread:3 took 16313ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
 
 

CHECK #4 : Kafka

 
1) Check the kafka logs,  search for: ERROR or WARN
 
Kafka logs are available from:
 
a) <NFS>/jarvis/kafka-logs/kafka-<#>/*.log
b) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Logs
 
Here is an example when kafka is not able to send the heart beat on time affecting the kafka to zookeeper connectivity:
 
INFO  [main-SendThread(zookeeper.doigivaudan.svc.cluster.local:2181):[email protected]] - Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect
 
 
 
2)  Check if there is a LAG processing the data
 
 
a) If you are using Openshift, go to the Openshift console | Applications | Pods | <kafka pod> | Terminal
Otherwise, you can ssh to any of the kafka pod:

kubectl get pods -n<dxi-namespace> | grep kafka
kubectl exec -ti <kafka-pod> sh -n<dxi-namespace>
 
b) Execute the below commands to identify if there is a LAG: 
 
 
List all available topics:

/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe 

List all consumer groups /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list
Check for a possible LAG in jarvis Recommendation: Verify that  column LAG is not always > 0

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group indexer

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier


If AXA data is impacted (ie: data displayed but late), see https://knowledge.broadcom.com/external/article/238179 

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group axa.transformer

 

Here is an example illustrating a LAG condition:

Here is an example illustrating the consumers disconnection condition:

Recommendation:  Restart jarvis services as below:

Scale down:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Scale up:
- jarvis-verifier                
- jarvis-lean-jarvis-indexer 
- jarvis-indexer

Below is the list of kubectl commands :

a) Scale down the following deployments:

kubectl scale --replicas=0 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=0 deployment jarvis-lean-jarvis-indexer  -n<namespace>
kubectl scale --replicas=0 deployment jarvis-indexer -n<namespace>

b) Verify that all pods are down:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

c) Scale up deployments 

kubectl scale --replicas=1 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=1 deployment jarvis-indexer -n<namespace>d

d) Verify that all pods are up and running:

kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"

e) Verify alarms and servicenow incidents are reported as expected

 

B) What to collect if the problem persist?

If after applying the above recommendations the problem persist, collect the below logs and contact Broadcom Support:

 
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
 
 

Additional Information


DX AIOps - Troubleshooting, Common Issues and Best Practices
https://knowledge.broadcom.com/external/article/190815/dx-oi-troubleshooting-common-issues-and.html

Attachments