Email alerts fail to trigger on Aria Operations if cluster state is degraded
search cancel

Email alerts fail to trigger on Aria Operations if cluster state is degraded

book

Article ID: 423584

calendar_today

Updated On:

Products

VMware Aria Operations (formerly vRealize Operations) 8.x

Issue/Introduction

  • Aria Operations may fail to trigger an email alert post threshold is exceeded on objects
  • Admin UI reports."Degraded," or nodes stuck in."Waiting for Analytics."
  • service vmware-casa status on the appliance shell returns active (running) on all nodes
  • On the primary node  /storage/log/vcops/analytics.log file may contain the following entries

    WARN [Threshold checker worker thread 3] com.vmware.vcops.platform.common.fsdb.FsdbClientBase.executeOnResourceMembersWithTimeout - Failed to execute function 'FsdbInterface.saveObservations' on server vRealize Ops Fsdb-##### for resourceId=####: FunctionException: The requested server(s) are not running 

    [AlarmQuery--thread-4] com.vmware.vcops.platform.common.sharding.ShardingGemfireFunctionExecutor.executeForPlatformResultOnNamedServer - Failed to execute function AlarmShardServerInterface.cancelAlarms : FunctionException: The requested server(s) are not running org.apache.geode.cache.execute.FunctionException: The requested server(s) are not running 
    Caused by: org.apache.geode.cache.execute.FunctionInvocationTargetException: The requested server(s) are not running 

    com.vmware.vcops.platform.common.fsdb.FsdbClientBase.executeOnResourceMembersWithTimeout - Failed to execute function 'FsdbInterface.saveObservations' on server vRealize Ops Fsdb-###### for resourceId=#####: FunctionException: The requested server(s) are not running

Cause

Aria operation sharding refers to scalability and High Availability (HA) configurations within the VMware Aria Operations platform, where data and operations are distributed across multiple nodes (shards) for performance, scale, and redundancy. 

The issue is caused by a Gemfire Cluster Partition or a "Split Brain" scenario where the Analytics service on one or more Data Nodes may lose connectivity with the Primary Node.

Impact on Alerting: Aria Operations uses a sharding mechanism to distribute data. If the node hosting the specific "Alarm Shard" for an object (e.g., a Virtual Machine) is unreachable, the system cannot process metrics against alert definitions.

Resolution

Perform a cluster-wide power cycle post validating the status to reset GemFire's internal node coordination.

  1. Validate Cluster Status

    • Log in to the Admin UI (https://<primary-node-ip>/admin).

    • Confirm the status of the nodes (e.g., Offline, Degraded).

  2. Take Cluster Offline

    • In the Admin UI, select the cluster and click Take Offline.

    • Wait for the status to show "Offline" for all nodes.

  3. Reboot Nodes

  4. Bring Cluster Online

    • Once all nodes are powered on and reachable, login to Admin UI.

    • Click Bring Online.

  5. Verify Resolution

    • Ensure the Cluster Status is Running in the Admin UI