Troubleshooting Elasticsearch/Opensearch related issues within vRealize Automation 7.x
search cancel

Troubleshooting Elasticsearch/Opensearch related issues within vRealize Automation 7.x

book

Article ID: 325892

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

VMware implements internal health checks against the Elasticsearch/Opensearch service to maintain vRealize Automation 7.x application reliability as embedded VMware Identity Manager instances heavily leverage Elasticsearch/Opensearch in its normal application operations.

Common troubleshooting steps to restore the health of an Elasticsearch/Opensearch, single or multi-node, embedded cluster instance(s) within the vRealize Automation 7.x appliance(s) are contained within this article.

Symptoms:

  • vRealize Automation 7.3 through 7.6 contain a number of unassigned shards when manually executing the following health check command:
curl http://localhost:9200/_cluster/health?pretty=true
Note:  Anything other than a Green / OK status can cause unpredictable application behavior.
  • vRealize Automation 7.6 Virtual Appliance Management Interface Summary health page fails on Elasticsearch/Opensearch health check



Environment

VMware Identity Manager 3.3.x

VMware vRealize Automation 7.x

 

Cause

Datacenter network and storage outages can persist UNASSIGNED shards in a cluster overtime during Elasticsearch/Opensearch shard assignment tasks on cluster recovery.

Resolution

Restoring Green Status to Elasticsearch/Opensearch health checks in a vRealize Automation 7.x Single node or Multi-Node Cluster

  1. SSH into the master vRealize Automation appliance.
  2. Determine current health status:
curl http://localhost:9200/_cluster/health?pretty=true
Example
{"cluster_name" : "horizon",
  "status" : "red",   
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 10,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10,  
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0}
 
Note:  In the above command output, Elasticsearch/Opensearch cluster status can be: Red, Yellow, Green.
The health status will flag as Red, if there are a number of UNASSIGNED shards within the cluster.
Note:  Elasticsearch/Opensearch logs are located at /opt/vmware/elasticsearch/logs/horizon.log  
 
  1. Determine node name(s) registered within the cluster:
curl -s -XGET http://localhost:9200/_cat/nodes
Example: cava-n-84-170.eng.vmware.com 127.0.0.1 6   d * Red Skull II
 
  1. If the command output from step 2 details more than zero UNASSIGNED shards, curl for further details on ALL shards piped to only UNASSIGNED:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
Example:
% Total  % Received % Xferd  Average Speed  Time  Time  Time  Current  Dload  Upload  Total  Spent  Left  Speed
100   980  100   980    0     0  54444      0 --:--:-- --:--:-- --:--:-- 54444
searchentities 2 r UNASSIGNED CLUSTER_RECOVERED
searchentities 0 r UNASSIGNED CLUSTER_RECOVERED
searchentities 3 r UNASSIGNED CLUSTER_RECOVERED
searchentities 1 r UNASSIGNED CLUSTER_RECOVERED
searchentities 4 r UNASSIGNED CLUSTER_RECOVERED
v3_2019-07-17  4 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  0 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  3 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  1 r UNASSIGNED INDEX_CREATED
v3_2019-07-17  2 r UNASSIGNED INDEX_CREATED
 
  1. Determine if the UNASSIGNED shards can be assigned to another replica member with the Cluster Reroute function and allocate command:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'

Note:  The following response may occur if a valid copy of this shard already exists on the master:

shard cannot be allocated on same node [qAoqsUEITxuNbLXA6NASiA] it already exists on
  1. If shards are orphaned and cannot be rerouted, attempt to cancel the replica shard:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"cancel":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'
  1. Determine if the UNASSIGNED shards fail to cancel and still persist by re-running Step #4.
  2. Continue to Step #9 only if there are UNASSIGNED shards after the previous Steps, #1-6.
  3. If the shards persist, delete them:
Note:  The below command will DELETE all UNASSIGNED shards from the Elasticsearch/Opensearch cluster.  It is recommended to first reallocate or cancel them first.
 
curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"
  1. Verify that all UNASSIGNED shards have been deleted by rerunning Step #4.
Example:
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1. Validate status returns "green":
curl http://localhost:9200/_cluster/health?pretty=true
Note:  The output should be showing 0 value for unassigned shards
Example:
{
  "cluster_name" : "horizon",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}



Additional Information

Elasticsearch/Opensearch a search and analytics engine, used for auditing, reports, and directory sync logs, is embedded within the VMware vRealize Automation / Identity Manager virtual appliance. To verify the health of Elasticsearch/Opensearch, you must use the curl tool. If curl is not installed on the windows machine, you can query from a Linux or Mac machine to curl http://<localhost>:9200/_cluster/health?pretty

Impact/Risks:
The shard is the unit at which Elasticsearch/Opensearch distributes data around the cluster. The speed at which Elasticsearch/Opensearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

Removing CLUSTER_RECOVERED and other stale and old UNASSIGNED shards has limited to no impact on a running cluster once removed.  If shards persist in UNASSIGNED for an extended period of time, unexpected application behavior may occur, to include a failure of the health status check for Elasticsearch/Opensearch.