Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX AIOps 2x

Resolution

1.Checklist

1. Check for possible disk space issues in the NFS and Elastic Nodes

Important: HD space in use should not be > 80% in ES nodes.

Here is an example of a message indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue

blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:

a) Increase disk size in ElasticSearch servers

b) Some elastic indices might be too big for the Jarvis Kron process to delete it, see https://knowledge.broadcom.com/external/article/272232

c) Reduce data retention or manually delete some unnecessary elastic indices, see

AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?
https://knowledge.broadcom.com/external/article/207161

AIOps - kafka data consuming all disk space in Elastic nodes
https://knowledge.broadcom.com/external/article/222125

2. Check for possible physical memory issues in the server(s):

a) to go each server in the cluster and run: free -h

b) run : kubectl describe nodes

"Warning System OOM encountered, victim process"

3. Run the below common queries:

Description	Syntax
Check Elastic Status (make sure status" : "green")	http(s)://<ES_URL>/_cluster/health?pretty&human
Check affected indices due to unassigned shards	http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
Displays nodes in cluster (check memory, cpu, load)	http(s)://<ES_URL>/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues	http(s)://<ES_URL>/_cluster/allocation/explain?pretty

IMPORTANT:

Starting from DX Platform 21.x, the Jarvis API and Elastic external routes are disabled by default but you can create them as documented here: https://knowledge.broadcom.com/external/article/226870

If creating the endpoints is not possible you can still query the internal elastic endpoint "http://jarvis-elasticsearch-lb:9200" as below:

- Connect to any kafka pod:

kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>

- Query elastic:

curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/health?pretty&human'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/nodes?v'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/allocation/explain?pretty'

Verifications:

a) Check if there are unassigned_shards, run: http(s)://<ES_URL>/_cluster/health?pretty&human

b) If unassigned_shards is > 0, run below 2 queries:

http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st

it will give you more details of the affected indices:

http(s)://<ES_URL>/_cluster/allocation/explain?pretty

it will give you more details of the why allocation failed

Solution:

a) Open Postman

b) Run below POST rest call to reassign failed shards, leave the body empty

http(s)://<ES_URL>/_cluster/reroute?retry_failed=true

4. Check the ElasticSearch logs, search for WARN or ERRORs

- Below messages indicate frequent garbage collections getting executed affecting ES performance:

[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations:

Increase resource memory on each of the ElasticSearch deployments to 64GB, the minimum of 32GB Memory for ES.

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html

DX AIOps - How to check Elastic Health

Article ID: 272228

Updated On: 10-04-2023