DX AIOps - How to check Elastic Health
search cancel

DX AIOps - How to check Elastic Health


Article ID: 272228


Updated On:


DX Operational Intelligence DX Application Performance Management CA App Experience Analytics


The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis  common performance and configuration issues.


DX AIOps 2x



1. Check for possible disk space issues in the NFS and Elastic Nodes

Important: HD space in use should not be > 80% in ES nodes.

Here is an example of a message indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue

blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:
a) Increase disk size in ElasticSearch servers  
b) Some elastic indices might be too big for the Jarvis Kron process to delete it, see https://knowledge.broadcom.com/external/article/272232 
c) Reduce data retention or manually delete some unnecessary elastic indices, see

AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?

AIOps - kafka data consuming all disk space in Elastic nodes

2. Check for possible physical memory issues in the server(s):

a) to go each server in the cluster and run:  free -h
b) run : kubectl describe nodes 
"Warning System OOM encountered, victim process"
3. Run the below common queries:
Description Syntax
Check Elastic Status (make sure status" : "green") http(s)://<ES_URL>/_cluster/health?pretty&human
Check affected indices due to unassigned shards  http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
Displays nodes in cluster (check memory, cpu, load) http(s)://<ES_URL>/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues http(s)://<ES_URL>/_cluster/allocation/explain?pretty


Starting from DX Platform 21.x, the  Jarvis API and Elastic external routes are disabled by default but you can create them as documented here:  https://knowledge.broadcom.com/external/article/226870

If creating the endpoints is not possible you can still query the internal elastic endpoint "http://jarvis-elasticsearch-lb:9200" as below:

- Connect to any kafka pod:

kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>

- Query elastic:

curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/health?pretty&human
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/nodes?v
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/allocation/explain?pretty

a) Check if there are unassigned_shards, run: http(s)://<ES_URL>/_cluster/health?pretty&human
b) If unassigned_shards is > 0, run below 2 queries:
it will give you more details of the affected indices:
it will give you more details of the why allocation failed

a) Open Postman

b) Run below POST rest call to reassign failed shards, leave the body empty


4. Check the ElasticSearch logs, search for WARN or ERRORs
- Below messages indicate frequent garbage collections getting executed affecting ES performance:
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}


Increase resource memory on each of the ElasticSearch deployments to 64GB, the minimum of 32GB Memory for ES.

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:


Additional Information