AIOps - The jarvis-kron pod keeps restarting and some elasticsearch indices are in yellow status.

Products

DX Operational Intelligence

Issue/Introduction

Symptoms:

1/ The jarvis-kron pod being restarted automaticly:

jarvis-kafka-3-7668886457-2mlb8 1/1 Running 0 131m
jarvis-kafka-57ff4f48c9-5rs2x 1/1 Running 0 131m
jarvis-kron-6fbc789555-chz7l 1/1 Running 5 131m
jarvis-lean-jarvis-indexer-565bfdbc97-nrbx4 1/1 Running 0 131m

2/ The kron health check timed out.:

wget http://<elastic-ip>:8080/kron/health
--2022-03-11 16:45:25-- http://<elastic-ip>:8080/kron/health
Connecting to <elastic-ip>:8080... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2022-03-11 17:00:26-- (try: 2) http://<elastic-ip>:8080/kron/health
Connecting to <elastic-ip>:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-03-11 17:00:46-- (try: 3) http://<elastic-ip>:8080/kron/health
Connecting to <elastic-ip>:8080... failed: Connection refused.

3/ The elasticsearch health check in yellow status: "https://<elastic-ip>/_cluster/health?level=indices". The elasticsearch health check indices can have the yellow status because of some unassigned shards inside: "unassigned_shards":4

{"cluster_name":"jarvis-docker","status":"yellow","timed_out":false,"number_of_nodes":12,"number_of_data_nodes":12,"active_primary_shards":137,"active_shards":270,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":4

4/ Unassigned_shards recovery failed due to ElasticsearchTimeoutException. If you apply proposed steps and solution described in the section "If unassigned_shards is > 0, run below 2 queries:" of the following article you get the recovery failed message due to some ElasticsearchTimeoutException as per the example below.

AIOps - Jarvis (Kafka, Zookeeper, elastic search) Troubleshooting

POST http(s)://<ES_URL>/_cluster/reroute?retry_failed=true

Recovery failed from {jarvis-elasticsearch-2}{NCPGKc2NSAu6z_gdUumtpg}{eKQyv9PKSvuqrL9mWMkkEg}{<elastic-ip>}{<elastic-ip>:9300}{dimr}{box_type=hot} into {jarvis-elasticsearch-12}{9NB4EF8MRo-_0Yrm5pbTaA}{e9puJGkgTqGg_yP1U8uiWQ}{<elastic-ip>}{<elastic-ip>:9300}{dimr}{box_type=hot} (no activity after [30m])]; nested: ElasticsearchTimeoutException

5/ Some functions may not work properly like for example some kibana dashboards.

6/ In the jarvis-es-utils.log the following kron timed out error is repeating over and over.

2022-03-14 08:10:43 INFO [main] UtilityController:165 - Kron service has not started yet. Will try again in 5 sec.
2022-03-14 08:11:48 ERROR [main] UtilityController:99 - Exception while connecting to kron
java.net.SocketTimeoutException: Read timed out

Environment

DX Operational Intelligence 2x
DX Application Performance Management 2x
DX AXA 2x

Cause

Physical underlying platform performance issues. Few nodes may be having some intermittent connectivity issues. Few nodes may be having disk pressure.

Resolution

1/ Try to restart only the 2 specific jarvis services below:

kubectl scale --replicas=0 deployment jarvis-kron -n<namespace>
kubectl scale --replicas=0 deployment jarvis-esutils -n<namespace>

Please wait until all pods are terminated and start them again. The jarvis-kron pod has to be started first.

kubectl scale --replicas=1 deployment jarvis-kron -n<namespace>
kubectl scale --replicas=1 deployment jarvis-esutils -n<namespace>

Wait for 10 minutes and check the jarvis health again. If you will still get the yellow state and the timed out errors in the logs proceed to the next step.

2/ Check if each of your elastic pod is running on a separate physical node. The situation where 2 elastic pods are running on the same physical node should be avoided due to the performance repercussions it can cause.

kubectl get pods -n<namespace> -o wide | grep jarvis-elastic

3/ Run the test of the IOPs throughput for NFS storage on your elastic nodes. Compare the results with the recommended values to ensure that you are not running far below the requirements.

AIOps - How to verify throughput for NFS (IOPs and Speed).

IMPORTANT NOTE: If it is the case you may face a lot of similar instability issues with your elasticsearch cluster. You may need to check with your infrastructure IT team on how to align your physical underlying platform with this throughput requirement to be able to ensure the long-term stability and capacity of your elasticsearch cluster. The workaround below may only help you to resolve this issue temporarily for a while. Please do not neglect the importance of the IOPs throughput for NFS storage.

WORKAROUND:

1/ Scale down the 2 specific jarvis services below and wait until these 2 pods are terminated

kubectl scale --replicas=0 deployment jarvis-kron -n<namespace>
kubectl scale --replicas=0 deployment jarvis-esutils -n<namespace>

2/ Log into your elastic master pod and run the following postman command from the command line. You should get the result similar to the one below.

kubectl get pods -n<namespace> | grep jarvis-elastic
kubectl exec -ti -n<namespace> jarvis-elasticsearch-<elastic_pod_id> sh

curl --location --request POST 'localhost:9200/jarvis_kron/_delete_by_query?conflicts=proceed&pretty' \
--header 'Content-Type: application/json' \
--data-raw '
{
"query": {
"match_all": {}
}
}'

3/ Scale up the 2 specific jarvis services below and wait until these 2 pods are started. The jarvis-kron pod has to be started first.

kubectl scale --replicas=1 deployment jarvis-kron -n<namespace>
kubectl scale --replicas=1 deployment jarvis-esutils -n<namespace>

Wait for 10 minutes and check the jarvis health again. This should become green now.

If the problem persists, open a case with Broadcom Support for assistance and attach to it all the jarvis logs from <nfs-dxi>/jarvis/*

DX AIOps - Jarvis (kafka, zookeeper, elasticSearch) Troubleshooting

EXPLANATION: All the jobs that were done in esutils like rollover functionality, apache and all will get stored in the jarvis-kron. Along with that when you make any health check calls on the jarvis - those also will get recorded in this particular jarvis_kron elasticsearch index. If there is a lot of latency in the system caused by the underlying platform performance issues for example lot of health checks are failing. As a consequence, a huge number of failed health checks are being stored. Any restart of the jarvis_kron pod triggers another health checks that can fail as well. The above query will delete all of these previous jobs being stored in the jarvis_kron elasticsearch index.

Additional Information

AIOPs - Troubleshooting, Common Issues and Best Practices