Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"
search cancel

Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"

book

Article ID: 372809

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Aria Operations cluster fails to go online, as the nodes are waiting on analytics.
  • Checking the analytics logs we see these messages:

2024-07-22T06:44:09,502+0000 INFO  [Analytics Main Thread]  com.vmware.statsplatform.persistence.sqldb.SQLHAInfoCache.setMasterSQLDBSliceInfo - setMasterSQLDBSliceInfo: Master info has been modified in HAInfoCache: (Gemfire Name: vRealize Ops Persistence-9916fbf1-5664-4415-8525-a43a0d8848bb, HostName: xx.xx.xx.xx, DBConnStr: jdbc:postgresql://xx.xx.xx.xx:5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca)
2024-07-22T06:44:09,502+0000 INFO  [Analytics Main Thread]  com.vmware.statsplatform.persistence.sqldb.SQLDBHAManagerImpl.broadcastDBSliceInfo - Master SQLDBSLiceInfo is changed  to INITIALIZED in HAInfoCache. GemfireName:vRealize Ops Persistence-9916fbf1-5664-4415-8525-a43a0d8848bb, hostname:xx.xx.xx.xx, connection string:jdbc:postgresql://xx.xx.xx.xx5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca, status:1
2024-07-22T06:44:09,510+0000 ERROR [Analytics Main Thread]  com.integrien.analytics.AnalyticsMain.run - AnalyticsMain.run failed with error: RuntimeException: Attempt to start not runnable node java.lang.RuntimeException: Attempt to start not runnable node
        at com.vmware.vcops.platform.gemfire.StartupBarrier.waitUntilTriggered(StartupBarrier.java:101) ~[alive_platform.jar:?]

Environment

Aria Operations 8.x

Cause

Analytics processes are not able to come online due to the divergence of the "CACHED_ROLES" document between the nodes. 

Resolution

  • Make sure we have an offline snapshot of all the Aria Operations nodes. Follow this KB only to create a snapshot (do not attempt to bring the cluster online)
  • Copy the file attached in this KB to /tmp/ location on one of the nodes in the cluster using winscp.
  • Take an SSH session to the same node.
  • Go to /tmp/ location
    # cd /tmp/
  • Change the permission of the restoreCachedRoles.py to executable
    # chmod 777 restoreCachedRoles.py
  • Run the restoreCachedRoles.py on one of the nodes in the cluster.
    # python restoreCachedRoles.py
  • Once it is successful, bring the cluster online following step 11 on this KB

Attachments

restoreCachedRoles.py get_app