Troubleshooting OpenSearch Cluster Health in VMware Identity Manager

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

OpenSearch health issues commonly arise when creating a cluster in VMware Identity Manager, often leading to errors displayed on the System Diagnostic/Resiliency page. This guide outlines steps to troubleshoot and resolve common issues, particularly concerning disk usage and RabbitMQ queues.

Environment

VMware Identity Manager 3.3.7

Cause

Common causes for cluster health issues include:

Insufficient disk space on the /db filesystem, which can prevent OpenSearch from writing new data.
High volume of pending messages in the RabbitMQ analytics queue, indicating potential processing bottlenecks.
Inconsistent cluster state across nodes, often leading to yellow or red cluster status.

Resolution

Check Disk Space:

Use the command:

df -h

Ensure the /db filesystem is monitored and does not exceed 85% usage. If nearing capacity, consider adjusting data retention policies or adding disk space.

Monitor RabbitMQ Queues:

Check the analytics queue size with:

rabbitmqctl list_queues | grep analytics

Investigate if the pending message count exceeds 100, and address any underlying issues.

Verify OpenSearch Cluster Health:

Check cluster health status:

curl 'http://localhost:9200/_cluster/health?pretty=true'

vidm-hostname:~# curl -k http://localhost:9200/_cluster/health?pretty=true
{
"cluster_name" : "<cluster-name>",
"status": "green",
"timed out" : false,
"number of nodes": 2,
"number of data nodes": 2,
"active_primary_shards" : 5, "active_shards": 10,
"relocating_shards" : 0,
"initializing_shards" : 0, "unassigned shards" : 0, "number_of_pending_tasks" : 0
}

Status indicators:

Green: everything is good, there are enough nodes in the cluster to ensure at least 2 full copies of the data spread across the cluster.
Yellow: functioning, but there are not enough nodes in the cluster to ensure HA (eg, a single node cluster will always be in the yellow state because it can never have 2 copies of the data).
Red: broken, unable to query existing data or store new data, typically due to not enough nodes in the cluster to function or out of disk space.

Check Cluster State Across Nodes:

Verify consistent primary cluster across nodes:

curl http://localhost:9200/_cluster/state/master_node,nodes?pretty

vidm-hostname:~# curl http://localhost:9200/_cluster/state/master_node, nodes?pretty "cluster_name" : "<cluster-name>",
{
"master_node" : "master node name",
"nodes" : {
: {
},
"name" : "One Above All",
"transport_address" : "xx.xx.xx.xx:9300",
"attributes" : {
}
"max_local_storage_nodes" _storage_nodes" : "1"
: 1
"name" : "test name",
"transport_address" : "xx.xx.xx.xx:9300", "attributes" : {
}
},
"max_local_storage_nodes" _storage_nodes" : "1"
: {
"name": "Hero for Hire"
"transport_address"
"transport address":"xx.xx.xx.xx:9300", "attributes" : {
"max_local_storage_nodes" "1"
}
}

Compare the output between all three nodes. A common reason for a yellow or red status on the OpenSearch cluster is that all nodes do not share a common view of the cluster.

You might see that one node does not list the same node as the primary and it might not even list the other nodes. This node is most likely the culprit of a yellow state.

Most of the time, a simple restart of the OpenSearch service is enough to bring the node back into the cluster. But also do the previous mentioned checks on disk space and free up or add disk space if required.

On the node at fault, run the following commands, first checking the status of the service:

service opensearch status
service opensearch stop
service opensearch start

Give OpenSearch time to start and verify that all nodes now report the same primary cluster and that the primary cluster lists all nodes:

curl http://localhost:9200/_cluster/state/master_node,nodes?pretty

Next, check the cluster health again:

curl 'http://localhost:9200/_cluster/health?pretty=true'

Monitor Logs:

Review OpenSearch logs located at /opt/vmware/opensearch/logs for additional insights on issues after migration from Elasticsearch.

(Note: for versions of vIDM earlier than 3.3.7, replace opensearch with elasticsearch wherever mentioned. These older versions are now EOL.)

Additional Information

If persistent red status occurs, determine the master node and restart its service. Monitor which node assumes the master role after the restart.
For data retention and growth management, refer to the OpenSearch documentation regarding policies.