DX AIOps - How to check Elastic Health
search cancel

DX AIOps - How to check Elastic Health

book

Article ID: 272228

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management CA App Experience Analytics

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis  common performance and configuration issues.

Environment

DX AIOps 2x

Resolution

1.Checklist

1. Check for possible disk space issues in the NFS and Elastic Nodes

Important: HD space in use should not be > 80% in ES nodes.

Here is an example of a message indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue

blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations: you have 2 options:
 
a) Increase disk size in ElasticSearch servers  
b) Some elastic indices might be too big for the Jarvis Kron process to delete it, see https://knowledge.broadcom.com/external/article/272232 
c) Reduce data retention or manually delete some unnecessary elastic indices, see
 

AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?
https://knowledge.broadcom.com/external/article/207161

 
AIOps - kafka data consuming all disk space in Elastic nodes
https://knowledge.broadcom.com/external/article/222125 
 

2. Check for possible physical memory issues in the server(s):

 
a) to go each server in the cluster and run:  free -h
 
b) run : kubectl describe nodes 
 
"Warning System OOM encountered, victim process"
 
 
 
3. Run the below common queries:
 
Description Syntax
Check Elastic Status (make sure status" : "green") http(s)://<ES_URL>/_cluster/health?pretty&human
Check affected indices due to unassigned shards  http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
Displays nodes in cluster (check memory, cpu, load) http(s)://<ES_URL>/_cat/nodes?v
Check for possible errors during allocation, to get explanation on cluster issues http(s)://<ES_URL>/_cluster/allocation/explain?pretty

IMPORTANT:

Starting from DX Platform 21.x, the  Jarvis API and Elastic external routes are disabled by default but you can create them as documented here:  https://knowledge.broadcom.com/external/article/226870

If creating the endpoints is not possible you can still query the internal elastic endpoint "http://jarvis-elasticsearch-lb:9200" as below:

- Connect to any kafka pod:

kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>

- Query elastic:

curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/health?pretty&human
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/nodes?v
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/allocation/explain?pretty

 
Verifications:
 
a) Check if there are unassigned_shards, run: http(s)://<ES_URL>/_cluster/health?pretty&human
 
 
b) If unassigned_shards is > 0, run below 2 queries:
 
http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
 
it will give you more details of the affected indices:
 
 
http(s)://<ES_URL>/_cluster/allocation/explain?pretty
 
it will give you more details of the why allocation failed
 
 
Solution:

a) Open Postman

b) Run below POST rest call to reassign failed shards, leave the body empty

http(s)://<ES_URL>/_cluster/reroute?retry_failed=true

 
 
4. Check the ElasticSearch logs, search for WARN or ERRORs
 
- Below messages indicate frequent garbage collections getting executed affecting ES performance:
 
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations

Increase resource memory on each of the ElasticSearch deployments to 64GB, the minimum of 32GB Memory for ES.

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

 
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log
 

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html