Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"
search cancel

Aria Operations cluster status stuck on Waiting for Analytics, analytics-.log has the error "Attempt to start not runnable node"

book

Article ID: 372809

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Aria Operations cluster fails to go online, as the nodes are waiting on analytics.
  • Primary/Replica failover occurred in the past
  • Upgrade is hanging
  • Checking the analytics logs we see these messages:

INFO  [Analytics Main Thread]  com.vmware.statsplatform.persistence.sqldb.SQLHAInfoCache.setMasterSQLDBSliceInfo - setMasterSQLDBSliceInfo: Master info has been modified in HAInfoCache: (Gemfire Name: vRealize Ops Persistence-########-####-####-####-############, HostName: ###.###.###.###, DBConnStr: jdbc:postgresql://###.###.###.###:5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca)
INFO  [Analytics Main Thread]  com.vmware.statsplatform.persistence.sqldb.SQLDBHAManagerImpl.broadcastDBSliceInfo - Master SQLDBSLiceInfo is changed  to INITIALIZED in HAInfoCache. GemfireName:vRealize Ops Persistence-########-####-####-####-############, hostname:###.###.###.###, connection string:jdbc:postgresql://###.###.###.###:5433/vcopsdb?ssl=true&sslfactory=org.postgresql.ssl.LibPQvROpsFactory&sslmode=verify-ca, status:1
ERROR [Analytics Main Thread]  com.integrien.analytics.AnalyticsMain.run - AnalyticsMain.run failed with error: RuntimeException: Attempt to start not runnable node java.lang.RuntimeException: Attempt to start not runnable node
        at com.vmware.vcops.platform.gemfire.StartupBarrier.waitUntilTriggered(StartupBarrier.java:101) ~[alive_platform.jar:?]

 

Environment

Aria Operations 8.x

Cause

Analytics processes are not able to come online due to the divergence of the "CACHED_ROLES" document between the nodes. 

Resolution

  1. Make sure take an offline snapshot of all the Aria Operations nodes. (Do not attempt to bring the cluster online)
  2. Copy the file attached in this KB to /tmp/ location on one of the nodes in the cluster using WinSCP or similar.
  3. Take an SSH session to the same node.
  4. Go to /tmp/ location
    cd /tmp/
  5. Change the permission of the restoreCachedRoles.py to executable 
    chmod 777 restoreCachedRoles.py
  6. Run the restoreCachedRoles.py on one of the nodes in the cluster.
    python restoreCachedRoles.py
  7. Once it is successful, bring the cluster online

Attachments

restoreCachedRoles.py get_app