Primary replica node status changes to "Waiting for analytics" while the cluster state becomes "Degraded"

search cancel

Primary replica node status changes to "Waiting for analytics" while the cluster state becomes "Degraded"

book

Article ID: 402973

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

After enabling Continuous Availability on your Aria Operations cluster, you observe the following:

Viewing the cluster status page from the admin UI indicates the primary replica is "Waiting for analytics"
The cluster status shows as "Degraded"
In the /storage/vcops/log/adapters/VCOpsAdapter/fping-stats-latency.csv file on primary and primary replica nodes indicate consistent high latency between the nodes and between the nodes and the witness:

DATE,IP,MAX,MIN,AVG 2025-06-19T22:26:07.508357,witness,200.0,6.62,120.97709190672138 2025-06-23T23:19:59.198435,witness,200.0,5.97,97.60765673981201 2025-06-19T22:31:07.410779,witness,196.0,6.49,74.33318364611257 2025-06-19T22:36:07.486360,witness,191.0,6.49,71.5389252336449 2025-06-19T22:26:07.508332,primary node,160.0,2.0,62.41635333333343 2025-06-23T23:19:59.198395,primary node,137.0,2.01,57.10421999999998 2025-06-26T09:38:01.913899,witness,182.0,6.34,49.115297261189056 2025-06-26T09:43:01.725168,witness,195.0,6.42,47.152745490982 2025-06-19T22:31:07.410750,primary node,133.0,1.99,37.66202134756507The example above are stats from a primary replica node showing an average ping result to the primary node between 37-62 ms and to the witness node between 47-120 ms, with latency spikes over 100ms.
In a support bundle, the /storage/vcops/log/casa/casa.log shows errors similar to:

2025-06-30T15:17:53,164+0000 WARN [pool-4-thread-1] [Az00000M] casa.suiteapi.SuiteApiInternalService:442 - Exception calling suite API GET collectorgroups/archaenabled/secrets; Request Id null: org.springframework.web.client.HttpServerErrorException$ServiceUnavailable: 503 Service Unavailable: "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><EOL><html><head><EOL><title>503 Service Unavailable</title><EOL></head><body><EOL><h1>Service Unavailable</h1><EOL><p>The server is temporarily unable to service your<EOL>request due to maintenance downtime or capacity<EOL>problems. Please try again later.</p><EOL><p>Additionally, a 503 Service Unavailable<EOL>error was encountered while trying to use an ErrorDocument to handle the request.</p><EOL></body></html><EOL>"

Environment

Aria Operations 8.18.x

Cause

Continuous Availability has very strict latency and packet loss threshold requirements. If these thresholds are breached consistently, Continuous Availability will not function properly.

For the latest requirements, see VMware Aria Operations 8.18 Sizing Guidelines

Resolution

Determine a way to improve network performance between the primary node, primary replica node, other node pairs and the witness, or reconfigure the cluster to not use Continuous Availability if the network requirements cannot be met.

See Activate or Deactivate Continuous Availability in VMware Aria Operations Cluster and Node Maintenance

Feedback

thumb_up Yes

thumb_down No