Symptoms:
- New alarms don't appear in OI Console
- ServiceNow tickets creation is not working
- Alarms are not getting updated
- New tenants don't appear in Elastic queries, ERROR 400 and 500 in OI Service and Performance Analytics
The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis (kafka, zookeeper, elasticsearch) common performance and configuration issues.
A) Checklist
B) What diagnostic files should I gather for CA Support?
DX Operational Intelligence 20.x, 21.x
DX Application Performance Management 20.x,21.x
-Locate the doi-nginx endpoint:
kubectl get ingress -ndxi | grep nginx
doi-nginx-ingress doi-nginx.lvntest010772.bpc.broadcom.net 80 89d
1) Find out the Jarvis API endpoint:
If Kubernetes: kubectl get ingress -n<dxi-namespace> | grep jarvis
for example:
kubectl get ingress -ndxi | grep jarvis-apis
jarvis-apis <none> apis.10.109.32.88.nip.io 10.109.32.88 80 19d
If Openshift: oc get routes -n<dxi-namespace> | grep jarvis
for example:
oc get routes -ndxi | grep jarvis-apis
jarvis-apis-jwkr5 apis.munqa001493.bpc.broadcom.net / jarvis-apis 8080 None
Find out the ElasticSearch endpoint:
If Kubernetes: kubectl get ingress -n<dxi-namespace> | grep jarvis-es
for example:
kubectl get ingress -ndxi | grep jarvis
jarvis-es <none> es.10.109.32.88.nip.io 10.109.32.88 80 19d
If Openshift: oc get routes -n<dxi-namespace> | grep jarvis-es
for example:
oc get routes -ndxi | grep jarvis
jarvis-es-7krrv es.munqa001493.bpc.broadcom.net / jarvis-elasticsearch-lb 9200 None
a) Open Postman
b) Run below POST rest call to reassign failed shards, leave the body empty
http(s)://<ES_URL>/_cluster/reroute?retry_failed=true
Recommendations: you have 2 options:
a) Restart the ElasticSearch pods
- Go to Openshift > Application > pods
- Locate the ElasticSearch pods
- Delete each pod -- new ones will be created.
b) Increase memory on each of the ElasticSearch deployments
- Go to Openshift > Application > deployments
- Locate the ElasticSearch deployments
- Click Actions > Edit Resource Limits
- Increase "Limit" by half (NOTE: make sure you have enough memory available in the elastic server, you can use: free -h)
- Click Save -- a new pod will be created
3) Check for disk space issues in Elastic nodes and NFS
tion [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
[16]: index [jarvis_jmetrics_1.0_1], type [_doc], id [3f1f4151-4b4a-4a04-bde1-4ae610358e81], message [ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [jarvis_jmetrics_1.0_1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]]
a) Increase disk size in ElasticSearch nodes
b) Reduce data retention, Delete elastic backups Or manually delete some unnecessary elastic indices, see: AIOps - NFS or Elastic Nodes disk full - How to reduce data retention
List all available topics: |
/opt/ca/kafka/bin/kafka-topics.sh --zookeeper jarvis-zookeeper:2181 --list /opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list /opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe |
List all consumer groups | /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --list |
Check for a possible LAG in jarvis Recommendation: Verify that column LAG is not always > 0 |
/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group jarvis_indexer /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group indexer /opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group verifier
/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group axa.transformer |
Here is an example illustrating a LAG condition:
Here is an example illustrating the consumers disconnection condition:
Recommendation: Restart jarvis services as below:
Scale down:
- jarvis-verifier
- jarvis-lean-jarvis-indexer
- jarvis-indexer
Scale up:
- jarvis-verifier
- jarvis-lean-jarvis-indexer
- jarvis-indexer
Below is the list of kubectl commands :
a) Scale down the following deployments:
kubectl scale --replicas=0 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=0 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=0 deployment jarvis-indexer -n<namespace>
b) Verify that all pods are down:
kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"
c) Scale up deployments
kubectl scale --replicas=1 deployment jarvis-verifier -n<namespace>
kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>
kubectl scale --replicas=1 deployment jarvis-indexer -n<namespace>d
d) Verify that all pods are up and running:
kubectl get pods -n<namespace> | egrep "jarvis-verifier|jarvis-lean|jarvis-indexer"
e) Verify alarms and servicenow incidents are reported as expected
If after applying the above recommendations the problem persist, collect the below logs and contact Broadcom Support:
DX AIOps - Troubleshooting, Common Issues and Best Practices
https://knowledge.broadcom.com/external/article/190815/dx-oi-troubleshooting-common-issues-and.html