VMware Identity Manager (vIDM / WSA) opensearch service will not start.

Products

VMware Aria Suite

Issue/Introduction

OpenSearch service will not start.

Running remediate or other vIDM requests through LCM may fail on the health check.
The LCMVIDM71095 error occurs within Aria Suite Lifecycle when managing VMware Identity Manager (VIDM).
The following error was found in the Aria Lifecycle Manager log (/var/log/vrlcm/vmware_vrlcm.log):

Error Code: LCMVIDM71095
com.vmware.vrealize.lcm.vidm.common.exception.VidmCommandExecutionException: Failed to restart the vIDM opensearch service Not running
at com.vmware.vrealize.lcm.vidm.core.task.VidmClusterRestartElasticSearchTask.restartOpenSearchService(VidmClusterRestartElasticSearchTask.java:188)
at com.vmware.vrealize.lcm.vidm.core.task.VidmClusterRestartElasticSearchTask.checkAndReStartElasticOrOpenSearchService(VidmClusterRestartElasticSearchTask.java:211)

In ssh session to the IDM nodes, we see that OpenSearch fails to start.

/etc/init.d/opensearch status

Not running

/etc/init.d/horizon-workspace status

RUNNING as PID=_____

Environment

VMware Identity Manager 3.3.7

Cause

This may be caused by stale Liquibase lock.

Resolution

First confirm that opensearch is Not Running but horizon-workspace is Running:
- /etc/init.d/opensearch status
- /etc/init.d/horizon-workspace status

Try to simply restart opensearch:
- /etc/init.d/opensearch restart

* If it spends minutes Waiting for IDM then you can kill it with Ctrl+C

A common cause of this issue is an inability to secure the lock. This can be caused by an unclean restart of opensearch for example.

Step 2 only needs to be executed once for the cluster. The remaining steps

1. Make sure Opensearch service is stopped on all nodes:
  /etc/init.d/opensearch stop
2. Release locks (once for the cluster is enough - run on psql primary node)
  /usr/sbin/hznAdminTool liquibaseOperations -forceReleaseLocks
3. Restart the main vIDM service - first on primary, wait a minute or two, then the other two nodes:
  /etc/init.d/horizon-workspace restart
4. Start opensearch on all nodes:
  /etc/init.d/opensearch start

Workaround: if `forceReleaseLocks` fails

If the hznAdminTool command above hangs and does not complete, there may be another lock which must be manually removed:
1. First confirm cluster health as per KB 367175: if hznAdminTool gives error "The connection attempt failed", this can indicate that the delegateIP needs to be assigned to the psql primary node on eth0:0.
2. Make sure Opensearch service is stopped on all nodes:
  /etc/init.d/opensearch stop
3. Password to db can be found using:
  cat /usr/local/horizon/conf/db.pwd
4. Log in to the DB on psql primary node with this command:
  sudo -u postgres psql -h localhost -U horizon saas
5. Check for a lock here:
  select * from saas.DatabaseChangeLogLock;
6. If there is a lock found above (t, with some date & IP address), remove it like so:
  update saas.DATABASECHANGELOGLOCK SET LOCKED=false, LOCKGRANTED=null, LOCKEDBY=null where ID=1;
7. Log out of the database with \q and issue steps 2,3,4 above: release Liquibase locks, restart horizon-workspace and then start OpenSearch.

(Note: for versions of vIDM earlier than 3.3.7, replace OpenSearch with elasticsearch wherever mentioned. These older versions are now EOL.)

Additional Information

Impact/Risks:

Brief service restart. If the vIDM is serving users in terms of login, there may be a momentary disconnect.
Health Status:
curl http://localhost:9200/_cluster/health?pretty=true

Green: everything is good, there are enough nodes in the cluster to ensure at least 2 full copies of the data spread across the cluster.

Yellow: functioning, but there are not enough nodes in the cluster to ensure HA (eg, a single node cluster will always be in the yellow state because it can never have 2 copies of the data).
*for single Node - Elasticsearch/Opensearch will be yellow for a single node by default as it doesn't have a cluster. for single node its expected and it should not be a problem, if facing no issue in functionality.

Red: broken, unable to query existing data or store new data, typically due to not enough nodes in the cluster to function or out of disk space.