During the final stages of a VMware Aria Operations upgrade, or after performing a rollback to a pre-upgrade snapshot, the cluster may become stuck in the "Bringing Online" state. Symptoms:
The cluster initialization failures and timeouts are caused by severe underlying storage performance degradation.
Aria Operations Analytics and underlying database services are highly sensitive to storage I/O. Sustained write latency spikes (e.g., reaching 500ms) directly cause the service initialization timeouts experienced during both the upgrade process and subsequent snapshot restorations.
To resolve this issue, the underlying storage performance must be stabilized before the Aria Operations upgrade can proceed.
Monitor the storage write latency using vCenter performance charts or the esxtop utility on the ESXi hosts (specifically the GAVG/wr and DAVG/wr metrics).
Remediate the root cause of the storage latency (e.g., address vSAN congestion or disk group issues)
Validate that the storage latency has stabilized to acceptable, nominal levels (e.g., ~4ms) using vCenter performance charts or vSAN monitoring.
If the cluster does not transition to "Online" automatically once latency is reduced, log in to each node via SSH and restart the services: service vmware-vcops stop && service vmware-vcops start
Ensure any snapshots utilized during the upgrade or rollback process adhere strictly to the supported procedure (Revert to Snapshots after failed upgrade).
Retry the VMware Aria Operations cluster upgrade only after the infrastructure performance issue is fully addressed.
If the upgrade task is completed at 4th step, the next steps (5 & 6) can be ignored.