Aria Operations cluster data collection stops functioning intermittently in Continuous Availability environment.
search cancel

Aria Operations cluster data collection stops functioning intermittently in Continuous Availability environment.

book

Article ID: 406434

calendar_today

Updated On:

Products

VMware Aria Operations (formerly vRealize Operations) 8.x

Issue/Introduction

Data collection on Aria Operations nodes has been non-operational.

Primary replica node is stuck in the state "Waiting for analytics".




Environment

VMware Aria Operations 8.x

Cause

  • vpostgres-repl service was found to be down on the primary replica node.

  • Command output from the primary replica node (vrops-status):

    Slice Online-trueadmin Role Enabled-true        
    vRealize Operations vPostgres Replication Database (vpostgres-repl) is not running.        
    Setting timezone to:  Etc/UTCSuccessfully set up environmental variables
    vRealize Operations Gemfire Locator (gemfire) is running (4152157).data Role Enabled-true        
    vRealize Operations vPostgres Database (vpostgres) is running (4154670).        
    Setting timezone to:  Etc/UTC
    Successfully set up environmental variables
    vRealize Operations Analytics (analytics) is running (4155482).        
    Setting timezone to:  Etc/UTC
    Successfully set up environmental variables
    vRealize Operations Collector (collector) is running (4156324).        
    Setting timezone to:  Etc/UTC
    Successfully set up environmental variables
    vRealize Operations API (api) is running (4157253).ui Role Enabled-true remote collector Role Enabled-false
  • According to VMware Aria Operations Continuous Availability sizing guidelines, network latency between fault domains should ideally be less than 10 ms, with occasional peaks allowed up to 20 ms during 20-second intervals.

  • This elevated latency causes significant replication delay between fault domains, leading to failure of the PostgreSQL replication service (vpostgres-repl) on the replica node. Consequently, data replication halts, and the primary replica gets stuck waiting for analytics data.






 

Resolution