Aria Operations cluster remains in the "Expanding" state due to a node version mismatch
search cancel

Aria Operations cluster remains in the "Expanding" state due to a node version mismatch

book

Article ID: 436361

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

  • In VMware Aria Operations 8.x, the cluster remains in a perpetual "Expanding" state after adding a new node.


  • Existing nodes (Primary/Data) display a status of "Waiting for analytics" .


  • A manual inspection on the Primary node will show the new data node locked in an "ADDING" state. To verify this, run the following command:
    cat /storage/db/casa/webapp/hsqldb/casa.db.script | grep -i online_state

    Example:
    INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"cluster","is_ha_enabled":false,"ha_transition_state":null,"ca_state":"DISABLED","initialization_state":"NONE","remove_node_state":"NONE","document_version":22,"document_time":#######,"online_state":"ADDING"

Cause

Aria Operations requires all nodes in a cluster to run the exact same version. If a new node is added with a different version (for example, adding an 8.18.6 node to an 8.18.2 cluster), internal communication and database synchronization will break. Because the versions do not match, the analytics service fails to start, leaving the new node stuck in an ADDING state and the existing cluster nodes stuck on Waiting for analytics.

Resolution

To restore cluster health and complete the migration, follow these steps:

Decommission the Mismatched Node:

  • Log in to the Aria Operations Admin UI (https://<FQDN>/admin).
  • Refer to the Aria Operations Shutdown/Startup Guide for taking node offline in proper sequencing
  • Take the cluster Offline and remove the newly added mismatched  node from the environment.

Stabilize the Cluster:

  • SSH into all remaining cluster nodes as the root user.
  • Clear hung tasks by restarting the casa service
    service vmware-casa restart

  • Take cluster online

Perform Corrective Expansion:

  • Deploy new Data Nodes using the identical version as the existing cluster.
  • Add nodes and verify they reach "Online" status before proceeding with any planned software upgrades