Email alerts fail to trigger on Aria Operations if cluster state is degraded

search cancel

Email alerts fail to trigger on Aria Operations if cluster state is degraded

book

Article ID: 423584

calendar_today

Updated On:

Products

VMware Aria Operations (formerly vRealize Operations) 8.x

Issue/Introduction

Aria Operations may fail to trigger an email alert post threshold is exceeded on objects
Admin UI reports."Degraded," or nodes stuck in."Waiting for Analytics."
service vmware-casa status on the appliance shell returns active (running) on all nodes
On the primary node /storage/log/vcops/analytics.log file may contain the following entries

WARN [Threshold checker worker thread 3] com.vmware.vcops.platform.common.fsdb.FsdbClientBase.executeOnResourceMembersWithTimeout - Failed to execute function 'FsdbInterface.saveObservations' on server vRealize Ops Fsdb-##### for resourceId=####: FunctionException: The requested server(s) are not running

[AlarmQuery--thread-4] com.vmware.vcops.platform.common.sharding.ShardingGemfireFunctionExecutor.executeForPlatformResultOnNamedServer - Failed to execute function AlarmShardServerInterface.cancelAlarms : FunctionException: The requested server(s) are not running org.apache.geode.cache.execute.FunctionException: The requested server(s) are not running
Caused by: org.apache.geode.cache.execute.FunctionInvocationTargetException: The requested server(s) are not running

com.vmware.vcops.platform.common.fsdb.FsdbClientBase.executeOnResourceMembersWithTimeout - Failed to execute function 'FsdbInterface.saveObservations' on server vRealize Ops Fsdb-###### for resourceId=#####: FunctionException: The requested server(s) are not running

Cause

Aria operation sharding refers to scalability and High Availability (HA) configurations within the VMware Aria Operations platform, where data and operations are distributed across multiple nodes (shards) for performance, scale, and redundancy.

The issue is caused by a Gemfire Cluster Partition or a "Split Brain" scenario where the Analytics service on one or more Data Nodes may lose connectivity with the Primary Node.

Impact on Alerting: Aria Operations uses a sharding mechanism to distribute data. If the node hosting the specific "Alarm Shard" for an object (e.g., a Virtual Machine) is unreachable, the system cannot process metrics against alert definitions.

Resolution

Perform a cluster-wide power cycle post validating the status to reset GemFire's internal node coordination.

Validate Cluster Status
- Log in to the Admin UI (https://<primary-node-ip>/admin).
- Confirm the status of the nodes (e.g., Offline, Degraded).
Take Cluster Offline
- In the Admin UI, select the cluster and click Take Offline.
- Wait for the status to show "Offline" for all nodes.
Reboot Nodes
- Proceed to perform a power cycle on all nodes in the Aria operations cluster. Refer to Shutdown and Startup sequence for Aria Operations cluster
Bring Cluster Online
- Once all nodes are powered on and reachable, login to Admin UI.
- Click Bring Online.
Verify Resolution
- Ensure the Cluster Status is Running in the Admin UI

Feedback

thumb_up Yes

thumb_down No