1.Checklist
1. Check for possible disk space issues in the NFS and Elastic Nodes
Important: HD space in use should not be > 80% in ES nodes.
Here is an example of a message indicating a problem with disk space, ElasticSearch is in read-only mode because of the disk issue
blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
Recommendations: you have 2 options:
a) Increase disk size in ElasticSearch servers
c) Reduce data retention or manually delete some unnecessary elastic indices, see
AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?
https://knowledge.broadcom.com/external/article/207161
2. Check for possible physical memory issues in the server(s):
a) to go each server in the cluster and run: free -h
b) run : kubectl describe nodes
"Warning System OOM encountered, victim process"
3. Run the below common queries:
Description |
Syntax |
Check Elastic Status (make sure status" : "green") |
http(s)://<ES_URL>/_cluster/health?pretty&human |
Check affected indices due to unassigned shards |
http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st |
Displays nodes in cluster (check memory, cpu, load) |
http(s)://<ES_URL>/_cat/nodes?v |
Check for possible errors during allocation, to get explanation on cluster issues |
http(s)://<ES_URL>/_cluster/allocation/explain?pretty |
IMPORTANT:
Starting from DX Platform 21.x, the Jarvis API and Elastic external routes are disabled by default but you can create them as documented here: https://knowledge.broadcom.com/external/article/226870
If creating the endpoints is not possible you can still query the internal elastic endpoint "http://jarvis-elasticsearch-lb:9200" as below:
- Connect to any kafka pod:
kubectl exec -ti <jarvis-kafka-pod> sh -n<namepsace>
- Query elastic:
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/health?pretty&human'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cat/nodes?v'
curl -XGET 'http://jarvis-elasticsearch-lb:9200/_cluster/allocation/explain?pretty'
Verifications:
a) Check if there are unassigned_shards, run: http(s)://<ES_URL>/_cluster/health?pretty&human
b) If unassigned_shards is > 0, run below 2 queries:
http(s)://<ES_URL>/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st
it will give you more details of the affected indices:
http(s)://<ES_URL>/_cluster/allocation/explain?pretty
it will give you more details of the why allocation failed
Solution:
a) Open Postman
b) Run below POST rest call to reassign failed shards, leave the body empty
http(s)://<ES_URL>/_cluster/reroute?retry_failed=true
4. Check the ElasticSearch logs, search for WARN or ERRORs
- Below messages indicate frequent garbage collections getting executed affecting ES performance:
[2020-08-17T05:53:52,234][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[2020-08-17T05:53:52,230][WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..
Recommendations:
Increase resource memory on each of the ElasticSearch deployments to 64GB, the minimum of 32GB Memory for ES.
2.What to collect if the problem persist?
If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:
<NFS>/jarvis/apis/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/logs/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/logs/<jarvis-esutils-pod>/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log