Upgrade of Aria Operations hangs due an issue with Cached Roles

Products

VCF Operations/Automation (formerly VMware Aria Suite) VCF Operations

Issue/Introduction

There are situations where the "CACHED ROLES" document can have the primary node's Admin role removed in HA/CA clusters '/storage/db/casa/webapp/hsqldb/casa.db.script' file.
Manually editing this file is risky and prone to mistakes causing additional cluster issues. The script mentioned in this article is used to detect and correct the cached role values automatically.

This issue has been faced during cluster maintenance processes (online, offline, etc) and during the upgrade process.

This issue could be a reason of different problems in an Aria Operations cluster. This problem can be faced during the upgrade process and because of the invalid roles in the cached_roles document, analytics isn't able to run successfully. This causes the whole upgrade process failure or you may find that the upgrade hangs.

Error in the /storage/log/vcops/log/casa/casa.log similar to below:

2025-05-08T05:46:41,798+0000  INFO [ajp-nio-xxx.x.0x.x-8011-exec-2] [xxxxxxxx] support.subprocess.GeneralCommand:255 - Command '/usr/bin/sudo -n /usr/lib/vmware-python-3/bin/python /usr/lib/vmware-vcopssuite/utilities/pakManager/bin/vcopsPakManager.py --action new_validate --pak vRealizeOperationsManagerEnterprise-818324521385 --json --force_content_update false --roles ADMIN,DATA,UI' threw exception: CommandLineExitException: key=general.failure; args=1,; cause=
2025-05-08T05:46:41,798+0000  WARN [ajp-nio-xxx.x.0x.x-8011-exec-2] [xxxxxxxx] casa.exception.CasaControllerExceptionHandler:212 - cause for exception = CommandLineExitException: key=general.failure

Another example is "Inventory sync" failure in VMware Aria Suite Lifecycle with Error code LCMVROPCONFIG20066, which uses casa API to get nodes roles. And will not see Aria Operations node having invalid roles in the cached_roles document in VMware Aria Suite Lifecycle under Aria Operations Environment.

In general we can say that if /casa/cluster/status API returns non valid role for one of the nodes (most likely for the master node), this means that you have faced this issue.

Environment

VMware Aria Operations 8.x

VMware Cloud Foundation Operations 9.x

Cause

The root cause of this failure currently is not known.

Resolution

To understand that we have this problem we need to check CACHED_ROLES document on each cluster member.

Take the cluster offline take snapshots of all nodes in the cluster as per kb How to take a Snapshot of VMware Aria Operations
Reboot the cluster nodes as per kb Shutdown and Startup sequence for Aria Operations cluster
Download the getCachedRoles.py and restoreCachedRoles.py scripts attached to this article.
Copy the script files to the /tmp directory using a SCP utility like WinSCP.
Login to an SSH session as root to the primary node.
- NOTE: The "vmware-casa" service must be running for the scripts to function properly. Run command service vmware-casa status to validate
Run the command: $VMWARE_PYTHON_3_BIN getCachedRoles.py to dump "cached roles" document from all nodes. The result could be found in the same directory with the name "cachedRoles.json".
- Example of a valid cluster:
  
  > Primary: ADMIN, DATA, UI
  > Primary Replica : ADMIN, DATA, UI, REPLICA
  > Data. : DATA, UI
  > Remote Collector : REMOTE_COLLECTOR
  > Witness : WITNESS
Review the file to determine if the ADMIN role is removed from the Primary or Primary Replica node
- NOTE: The first entry in the file is typically the Primary node and lists what cachedRoles information it has for itself and the remaining nodes in the cluster. The next section would have the next node in the cluster such as the Replica and what it has for cachedRoles for the Primary node, itself and then all other nodes.
If the ADMIN role is missing on the Primary or Primary Replica node, run the command $VMWARE_PYTHON_3_BIN restoreCachedRoles.py --restore to restore the CACHED_ROLES document, which should fix the roles
Try to then start the analytics service if it was not running or resume any cluster online, upgrade or sync operations to determine if the issue is now resolved.

Attachments

restoreCachedRoles.py get_app

getCachedRoles.py get_app