Products

DX Operational Observability

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX O2 OnPremise

Resolution

1.Checklist

1. Check for possible disk space issues in the NFS and Elastic Nodes if they are LOCAL.

-Disk space used shouldn't be > 80% in the ES nodes.

-Here is an example of a message indicating a problem with disk space, ElasticSearch become in read-only mode because of the disk space issue

blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]

Recommendations:

a) Increase disk size in ElasticSearch servers

b) Reduce data retention, see:

DX O2 ElasticSearch disk Full - How to reduce Elastic data retention?

DX O2 kafka data consuming all disk space in Elastic nodes

2. Check for possible physical memory issues in the server(s):

a) check memory availability in in all nodes: free -h

b) check the cluster health: kubectl describe nodes

Here is an example of an OOM situation: "Warning System OOM encountered, victim process"

3. Check ElasticSearch health

- Connect to any elastic pod:

kubectl exec -ti <elastic-pod> sh -n<namepsace> -- bash

example:

kubectl exec -ti elasticsearch-master-0 sh -ndxi -- bash

- Query elastic:

curl -XGET 'http://localhost:9200/_cluster/health?pretty&human'
curl -XGET 'http://localhost:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'
curl -XGET 'http://localhost:9200/_cat/nodes?v'
curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'

Verification:

a) Check if there are unassigned_shards, run: curl -XGET 'http://localhost:9200/_cluster/health?pretty&human'

b) If unassigned_shards is > 0, run below 2 queries:

curl -XGET 'http://localhost:9200/_cat/shards?v&h=n,i,s,dc,pr,cds,iiti,st'

it will give you more details of the affected indices:

curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'

it will give you more details of the why allocation failed

Solution:

Run:

curl -XGET 'http://localhost:9200/_cluster/reroute?retry_failed=true'

4. Check the ElasticSearch logs, search for WARN or ERRORs

Below an example of a memory issue affecting ElasticSearch:

[WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][476888] overhead, spent [1.3s] collecting in the last [1.4s]
..
[WARN ][o.e.m.j.JvmGcMonitorService] [oFOLnGK] [gc][young][476888][10243] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[17m], memory [9.5gb]->[3.6gb]/[10gb], all_pools {[young] [5.5gb]->[4mb]/[0b]}{[survivor] [40mb]->[172mb]/[0b]}{[old] [3.9gb]->[3.5gb]/[10gb]}
..

Recommendations:

Double memory on each of the ElasticSearch deployments.

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

<NFS>/jarvis/api/logs/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/<jarvis-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/*.log
<NFS>/jarvis/zookeeper-logs/zookeeper-<#>/*.log

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html

Checking DX O2 ElasticSearch Health

Article ID: 272228

Updated On: