Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"

search cancel

Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"

book

Article ID: 372809

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Aria Operations cluster fails to go online, as the nodes are waiting on analytics.
Checking the analytics logs we see these messages:

2024-07-22T06:44:09,502+0000 INFO [Analytics Main Thread] com.vmware.statsplatform.persistence.sqldb.SQLHAInfoCache.setMasterSQLDBSliceInfo - setMasterSQLDBSliceInfo: Master info has been modified in HAInfoCache: (Gemfire Name: vRealize Ops Persistence-9916fbf1-5664-4415-8525-a43a0d8848bb, HostName: xx.xx.xx.xx, DBConnStr: jdbc:postgresql://xx.xx.xx.xx:5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca)
2024-07-22T06:44:09,502+0000 INFO [Analytics Main Thread] com.vmware.statsplatform.persistence.sqldb.SQLDBHAManagerImpl.broadcastDBSliceInfo - Master SQLDBSLiceInfo is changed to INITIALIZED in HAInfoCache. GemfireName:vRealize Ops Persistence-9916fbf1-5664-4415-8525-a43a0d8848bb, hostname:xx.xx.xx.xx, connection string:jdbc:postgresql://xx.xx.xx.xx5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca, status:1
2024-07-22T06:44:09,510+0000 ERROR [Analytics Main Thread] com.integrien.analytics.AnalyticsMain.run - AnalyticsMain.run failed with error: RuntimeException: Attempt to start not runnable node java.lang.RuntimeException: Attempt to start not runnable node
at com.vmware.vcops.platform.gemfire.StartupBarrier.waitUntilTriggered(StartupBarrier.java:101) ~[alive_platform.jar:?]

Environment

Aria Operations 8.x

Cause

Analytics processes are not able to come online due to the divergence of the "CACHED_ROLES" document between the nodes.

Resolution

Make sure we have an offline snapshot of all the Aria Operations nodes. Follow this KB only to create a snapshot (do not attempt to bring the cluster online)
Copy the file attached in this KB to /tmp/ location on one of the nodes in the cluster using winscp.
Take an SSH session to the same node.
Go to /tmp/ location
# cd /tmp/
Change the permission of the restoreCachedRoles.py to executable
# chmod 777 restoreCachedRoles.py
Run the restoreCachedRoles.py on one of the nodes in the cluster.
# python restoreCachedRoles.py
Once it is successful, bring the cluster online following step 11 on this KB

Attachments

restoreCachedRoles.py get_app

Feedback

thumb_up Yes

thumb_down No