Aria Operations cluster is stuck at "bringing online" state for unexpectedly longer duraiton
search cancel

Aria Operations cluster is stuck at "bringing online" state for unexpectedly longer duraiton

book

Article ID: 433864

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

During the final stages of a VMware Aria Operations upgrade, or after performing a rollback to a pre-upgrade snapshot, the cluster may become stuck in the "Bringing Online" state. Symptoms:

  • The Admin UI shows the cluster state as "Bringing Online" with nodes "Waiting for Analytics."
  • The upgrade does not progress to completion for several hours.
  • Diagnostic log bundle collection is extremely slow, potentially taking up to 3 hours.

Cause

The cluster initialization failures and timeouts are caused by severe underlying storage performance degradation.

Aria Operations Analytics and underlying database services are highly sensitive to storage I/O. Sustained write latency spikes (e.g., reaching 500ms) directly cause the service initialization timeouts experienced during both the upgrade process and subsequent snapshot restorations.

Resolution

To resolve this issue, the underlying storage performance must be stabilized before the Aria Operations upgrade can proceed.

  1. Monitor the storage write latency using vCenter performance charts or the esxtop utility on the ESXi hosts (specifically the GAVG/wr and DAVG/wr metrics).

  2. Remediate the root cause of the storage latency (e.g., address vSAN congestion or disk group issues)

  3. Validate that the storage latency has stabilized to acceptable, nominal levels (e.g., ~4ms) using vCenter performance charts or vSAN monitoring.

  4. If the cluster does not transition to "Online" automatically once latency is reduced, log in to each node via SSH and restart the services: service vmware-vcops stop && service vmware-vcops start

  5. Ensure any snapshots utilized during the upgrade or rollback process adhere strictly to the supported procedure (Revert to Snapshots after failed upgrade).

  6. Retry the VMware Aria Operations cluster upgrade only after the infrastructure performance issue is fully addressed.

If the upgrade task is completed at 4th step, the next steps (5 & 6) can be ignored. 

Additional Information

 

Create a Snapshot as Part of an Update