1.Checklist
1. Check for possible disk space issues in the NFS and Elastic Nodes if they are LOCAL.
-Disk space used shouldn't be > 80% in the ES nodes.
-Here is an example of a message indicating a problem with disk space, ElasticSearch become in read-only mode because of the disk space issue
blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
Recommendations:
a) Increase disk size in ElasticSearch servers
b) Reduce data retention, see:
DX O2 ElasticSearch disk Full - How to reduce Elastic data retention?
DX O2 kafka data consuming all disk space in Elastic nodes
2. Check for possible physical memory issues in the server(s):
a) check memory availability in in all nodes: free -h
b) check the cluster health: kubectl describe nodes
Here is an example of an OOM situation: "Warning System OOM encountered, victim process"
3. Check ElasticSearch health
- Connect to any elastic pod:
kubectl exec -ti <elastic-pod> sh -n<namepsace> -- bash
example:
kubectl exec -ti elasticsearch-master-0 sh -ndxi -- bash
- Query elastic:
curl -XGET 'http://localhost:9200/_cluster/health?pretty&human'
curl -XGET 'http://localhost:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'
curl -XGET 'http://localhost:9200/_cat/nodes?v'
curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'
Verification:
a) Check if there are unassigned_shards, run: curl -XGET 'http://localhost:9200/_cluster/health?pretty&human'
b) If unassigned_shards is > 0, run below 2 queries:
curl -XGET 'http://localhost:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'
it will give you more details of the affected indices:
curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'
it will give you more details of the why allocation failed

Solution:
Run:
curl -XGET 'http://localhost:9200/_cluster/reroute?retry_failed=true'
4. Check the ElasticSearch logs, search for WARN or ERRORs
Below an example of a memory issue affecting ElasticSearch:
[WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..
Recommendations:
Double memory on each of the ElasticSearch deployments.
2.What to collect if the problem persist?
If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:
<NFS>/jarvis/api/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log