Primary replica node status changes to "Waiting for analytics" while the cluster state becomes "Degraded"
search cancel

Primary replica node status changes to "Waiting for analytics" while the cluster state becomes "Degraded"

book

Article ID: 402973

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

After enabling Continuous Availability on your Aria Operations cluster, you observe the following:

  • Viewing the cluster status page from the admin UI indicates the primary replica is "Waiting for analytics"
  • The cluster status shows as "Degraded"
  •  In the /storage/vcops/log/adapters/VCOpsAdapter/fping-stats-latency.csv file on primary and primary replica nodes indicate consistent high latency between the nodes and between the nodes and the witness:

    DATE,IP,MAX,MIN,AVG
    2025-06-19T22:26:07.508357,witness,200.0,6.62,120.97709190672138
    2025-06-23T23:19:59.198435,witness,200.0,5.97,97.60765673981201
    2025-06-19T22:31:07.410779,witness,196.0,6.49,74.33318364611257
    2025-06-19T22:36:07.486360,witness,191.0,6.49,71.5389252336449
    2025-06-19T22:26:07.508332,primary node,160.0,2.0,62.41635333333343
    2025-06-23T23:19:59.198395,primary node,137.0,2.01,57.10421999999998
    2025-06-26T09:38:01.913899,witness,182.0,6.34,49.115297261189056
    2025-06-26T09:43:01.725168,witness,195.0,6.42,47.152745490982
    2025-06-19T22:31:07.410750,primary node,133.0,1.99,37.66202134756507

    The example above are stats from a primary replica node showing an average ping result to the primary node between 37-62 ms and to the witness node between 47-120 ms, with latency spikes over 100ms.

  • In a support bundle, the /storage/vcops/log/casa/casa.log shows errors similar to:

    2025-06-30T15:17:53,164+0000  WARN [pool-4-thread-1] [Az00000M] casa.suiteapi.SuiteApiInternalService:442 - Exception calling suite API GET collectorgroups/archaenabled/secrets; Request Id null: org.springframework.web.client.HttpServerErrorException$ServiceUnavailable: 503 Service Unavailable: "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><EOL><html><head><EOL><title>503 Service Unavailable</title><EOL></head><body><EOL><h1>Service Unavailable</h1><EOL><p>The server is temporarily unable to service your<EOL>request due to maintenance downtime or capacity<EOL>problems. Please try again later.</p><EOL><p>Additionally, a 503 Service Unavailable<EOL>error was encountered while trying to use an ErrorDocument to handle the request.</p><EOL></body></html><EOL>"

Environment

Aria Operations 8.18.x

Cause

Continuous Availability has very strict latency and packet loss threshold requirements. If these thresholds are breached consistently, Continuous Availability will not function properly. 

For the latest requirements, see VMware Aria Operations 8.18 Sizing Guidelines

Resolution

Determine a way to improve network performance between the primary node, primary replica node, other node pairs and the witness, or reconfigure the cluster to not use Continuous Availability if the network requirements cannot be met.

See Activate or Deactivate Continuous Availability in VMware Aria Operations Cluster and Node Maintenance