Troubleshooting Elasticsearch related issues within vRealize Automation 7.x
search cancel

Troubleshooting Elasticsearch related issues within vRealize Automation 7.x

book

Article ID: 325892

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

VMware implements internal health checks against the ElasticSearch service to maintain vRealize Automation 7.x application reliability as embedded VMware Identity Manager instances heavily leverage ElasticSearch in its normal application operations.

Common troubleshooting steps to restore the health of an ElasticSearch, single or multi-node, embedded cluster instance(s) within the vRealize Automation 7.x appliance(s) are contained within this article.

Symptoms:
  • vRealize Automation 7.3 through 7.6 contain a number of unassigned shards when manually executing the following health check command:
curl http://localhost:9200/_cluster/health?pretty=true
Note:  Anything other than a Green / OK status can cause unpredictable application behavior.
  • vRealize Automation 7.6 Virtual Appliance Management Interface Summary health page fails on Elasticsearch health check


Environment

VMware vRealize Automation 7.x

Cause

Datacenter network and storage outages can persist UNASSIGNED shards in a cluster overtime during ElasticSearch shard assignment tasks on cluster recovery.

Resolution

Restoring Green Status to ElasticSearch health checks in a vRealize Automation 7.x Single node or Multi-Node Cluster

  1. SSH into the master vRealize Automation appliance.
  2. Determine current health status:
curl http://localhost:9200/_cluster/health?pretty=true
Example
{"cluster_name" : "horizon",
  "status" : "red",   
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 10,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10,  
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0}
Note:  In the above command output, Elasticsearch cluster status can be: Red, Yellow, Green.
The health status will flag as Red, if there are a number of UNASSIGNED shards within the cluster.
Note:  Elasticsearch logs are located at /opt/vmware/elasticsearch/logs/horizon.log
  1. Determine node name(s) registered within the cluster:
curl -s -XGET http://localhost:9200/_cat/nodes
Example: cava-n-84-170.eng.vmware.com 127.0.0.1 6   d * Red Skull II
  1. If the command output from step 2 details more than zero UNASSIGNED shards, curl for further details on ALL shards piped to only UNASSIGNED:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
Example:
% Total  % Received % Xferd  Average Speed  Time  Time  Time  Current  Dload  Upload  Total  Spent  Left  Speed
100   980  100   980    0     0  54444      0 --:--:-- --:--:-- --:--:-- 54444
searchentities 2 r UNASSIGNED CLUSTER_RECOVERED
searchentities 0 r UNASSIGNED CLUSTER_RECOVERED
searchentities 3 r UNASSIGNED CLUSTER_RECOVERED
searchentities 1 r UNASSIGNED CLUSTER_RECOVERED
searchentities 4 r UNASSIGNED CLUSTER_RECOVERED
v3_2019-07-17  4 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  0 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  3 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  1 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  2 r UNASSIGNED INDEX_CREATED
  1. Determine if the UNASSIGNED shards can be assigned to another replica member with the Cluster Reroute function and allocate command:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'

Note:  The following response may occur if a valid copy of this shard already exists on the master:

shard cannot be allocated on same node [qAoqsUEITxuNbLXA6NASiA] it already exists on
  1. If shards are orphaned and cannot be rerouted, attempt to cancel the replica shard:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"cancel":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'
  1. Determine if the UNASSIGNED shards fail to cancel and still persist by re-running Step #4.
  2. Continue to Step #9 only if there are UNASSIGNED shards after the previous Steps, #1-6.
  3. If the shards persist, delete them:
Note:  The below command will DELETE all UNASSIGNED shards from the ElasticSearch cluster.  It is recommended to first reallocate or cancel them first.
 
curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"
  1. Verify that all UNASSIGNED shards have been deleted by rerunning Step #4.
Example:
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1. Validate status returns "green":
curl http://localhost:9200/_cluster/health?pretty=true
Note:  The output should be showing 0 value for unassigned shards
Example:
{
  "cluster_name" : "horizon",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}


Additional Information

Elasticsearch, a search and analytics engine, used for auditing, reports, and directory sync logs, is embedded within the VMware vRealize Automation / Identity Manager virtual appliance. To verify the health of Elasticsearch, you must use the curl tool. If curl is not installed on the windows machine, you can query from a Linux or Mac machine to curl http://<localhost>:9200/_cluster/health?pretty

Impact/Risks:
The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

Removing CLUSTER_RECOVERED and other stale and old UNASSIGNED shards has limited to no impact on a running cluster once removed.  If shards persist in UNASSIGNED for an extended period of time, unexpected application behavior may occur, to include a failure of the health status check for ElasticSearch.