Troubleshooting Elasticsearch/Opensearch related issues within vRealize Automation 7.x

Products

VMware Aria Suite

Issue/Introduction

VMware implements internal health checks against the Elasticsearch/Opensearch service to maintain vRealize Automation 7.x application reliability as embedded VMware Identity Manager instances heavily leverage Elasticsearch/Opensearch in its normal application operations.

Common troubleshooting steps to restore the health of an Elasticsearch/Opensearch, single or multi-node, embedded cluster instance(s) within the vRealize Automation 7.x appliance(s) are contained within this article.

Symptoms:

vRealize Automation 7.3 through 7.6 contain a number of unassigned shards when manually executing the following health check command:

curl http://localhost:9200/_cluster/health?pretty=true

Note: Anything other than a Green / OK status can cause unpredictable application behavior.

vRealize Automation 7.6 Virtual Appliance Management Interface Summary health page fails on Elasticsearch/Opensearch health check

Environment

VMware Identity Manager 3.3.x

VMware vRealize Automation 7.x

Cause

Datacenter network and storage outages can persist UNASSIGNED shards in a cluster overtime during Elasticsearch/Opensearch shard assignment tasks on cluster recovery.

Resolution

Restoring Green Status to Elasticsearch/Opensearch health checks in a vRealize Automation 7.x Single node or Multi-Node Cluster

SSH into the master vRealize Automation appliance.
Determine current health status:

curl http://localhost:9200/_cluster/health?pretty=true
Example:
{"cluster_name" : "horizon",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0}

Note: In the above command output, Elasticsearch/Opensearch cluster status can be: Red, Yellow, Green.
The health status will flag as Red, if there are a number of UNASSIGNED shards within the cluster.
Note: Elasticsearch/Opensearch logs are located at /opt/vmware/elasticsearch/logs/horizon.log

Determine node name(s) registered within the cluster:

curl -s -XGET http://localhost:9200/_cat/nodes
Example: cava-n-84-170.eng.vmware.com 127.0.0.1 6 d * Red Skull II

If the command output from step 2 details more than zero UNASSIGNED shards, curl for further details on ALL shards piped to only UNASSIGNED:

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED

Example:
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 980 100 980 0 0 54444 0 --:--:-- --:--:-- --:--:-- 54444
searchentities 2 r UNASSIGNED CLUSTER_RECOVERED
searchentities 0 r UNASSIGNED CLUSTER_RECOVERED
searchentities 3 r UNASSIGNED CLUSTER_RECOVERED
searchentities 1 r UNASSIGNED CLUSTER_RECOVERED
searchentities 4 r UNASSIGNED CLUSTER_RECOVERED
v3_2019-07-17 4 r UNASSIGNED INDEX_CREATED
v3_2019-07-17 0 r UNASSIGNED INDEX_CREATED
v3_2019-07-17 3 r UNASSIGNED INDEX_CREATED
v3_2019-07-17 1 r UNASSIGNED INDEX_CREATED
v3_2019-07-17 2 r UNASSIGNED INDEX_CREATED

Determine if the UNASSIGNED shards can be assigned to another replica member with the Cluster Reroute function and allocate command:

Note: ElasticSearch Cluster Reroute API Technical Documentation

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'

Note: The following response may occur if a valid copy of this shard already exists on the master:

shard cannot be allocated on same node [qAoqsUEITxuNbLXA6NASiA] it already exists on

If shards are orphaned and cannot be rerouted, attempt to cancel the replica shard:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"cancel":{"index":"searchentities","shard":0,"node":"Red Skull II","allow_primary":"true"}}]}'

Determine if the UNASSIGNED shards fail to cancel and still persist by re-running Step #4.
Continue to Step #9 only if there are UNASSIGNED shards after the previous Steps, #1-6.
If the shards persist, delete them:

Note: The below command will DELETE all UNASSIGNED shards from the Elasticsearch/Opensearch cluster. It is recommended to first reallocate or cancel them first.

curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"

Verify that all UNASSIGNED shards have been deleted by rerunning Step #4.

Example:
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0

Validate status returns "green":

curl http://localhost:9200/_cluster/health?pretty=true

Note: The output should be showing 0 value for unassigned shards
Example:
{
"cluster_name" : "horizon",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}

Additional Information

Elasticsearch/Opensearch a search and analytics engine, used for auditing, reports, and directory sync logs, is embedded within the VMware vRealize Automation / Identity Manager virtual appliance. To verify the health of Elasticsearch/Opensearch, you must use the curl tool. If curl is not installed on the windows machine, you can query from a Linux or Mac machine to curl http://<localhost>:9200/_cluster/health?pretty

Impact/Risks:
The shard is the unit at which Elasticsearch/Opensearch distributes data around the cluster. The speed at which Elasticsearch/Opensearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

Removing CLUSTER_RECOVERED and other stale and old UNASSIGNED shards has limited to no impact on a running cluster once removed. If shards persist in UNASSIGNED for an extended period of time, unexpected application behavior may occur, to include a failure of the health status check for Elasticsearch/Opensearch.