The appliance disks fill up frequently due to vmo_tokenreplay growing rapidly when using VMware Aria Automation or Aria Orchestrator

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

The a VMware Aria Automation or Automation Orchestrator cluster is not healthy and some pods fail.

You see a status similar to:

NAME          READY   STATUS                      RESTARTS    AGE
Pod-name      0/1     Init:ErrImageNeverPull      12          14h

The VMware Aria Automation or Orchestrator appliance data partition (/data) disk usage is above 80% and fills up frequently, even if more disk space is added.
For example:
```
root@vra-appliance [ / ]# df -h /data
Filesystem                    Size Used Avail Use% Mounted on
/dev/mapper/data_vg-data      196G 157G  30G  85%  /data
```

The postgreSQL database "vco-db" is more than few gigabytes and is growing fast.
For example:

template1=# SELECT pg_database.datname as "database_name", pg_database_size(pg_database.datname)/1024/1024 AS size_in_mb FROM pg_database ORDER by size_in_mb DESC;

   database_name    | size_in_mb 
--------------------+------------
 vco-db             |         77000
 provisioning-db    |         66
 catalog-db         |         12

The vmo_tokenreplay table in the vco-db postgreSQL database is more than few gigabytes and is growing fast.
For example:

template1=# \c vco-db
You are now connected to database "vco-db" as user "postgres".
vco-db=# SELECT
   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size" 
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;
          Table          |  Size       
-------------------------+-------------
 vmo_tokenreplay         | 53536123 kB
 vmo_vroconfiguration    | 544 kB      
 vmo_scriptmodule        | 520 kB      
 vmo_scriptmodulecontent | 464 kB      
 vmo_contentsignature    | 456 kB

Environment

VMware Aria Orchestrator 8.x
VMware Aria Automation 8.x

Cause

This issue occurs when disk pressure is experienced in Kubernetes and is standard behavior when the data disk usage goes above 80%. Kubernetes attempts to free some space by deleting the available docker images. This causes some of the services to fail.
Workflow tokens may bloat in size when running too many intensive operations in your scriptable tasks or workflows. The platform allows for a user of Orchestrator to write code that may write too much information to the workflow execution, and subsequently the database.

Resolution

To work around this issue please follow the following steps :

Note: Before proceeding with the steps below, VMware recommends to backup the Aria Automation/Aria Orchestrator system using snapshots without stopping the VMs.

SSH login to one of the Aria Automation/Aria Orchestrator nodes.

Note: In case of of cluster deployments, complete the steps below:

a. Identify the primary postgres pod, using the "vracli status" command:

For example:

root@vra-appliance [ ~ ]# vracli status | grep primary -B 2
"Total data size": "263 MB",
"Conninfo": "host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10",
"Role": "primary",

b. Identify the Aria Automation node running the primary db pod with the "kubectl -n prelude get pods -o wide" command:

For example:

root@vra-appliance [ ~ ]# kubectl -n prelude get pods -o wide| grep postgres-1
postgres-1 1/1 Running 0 15h x.x.x.x vra-appliance.domain.com <none> <none>
SSH login to the Aria Automation/Aria Orchestrator primary DB node.
Connect to the postgreSQL database and delete the Aria Orchestrator token replay table content:

vracli dev psql
template1=# \c vco-db
You are now connected to database "vco-db" as user "postgres".
vco-db=# TRUNCATE table vmo_tokenreplay;
TRUNCATE TABLE
On each Aria Automation/Aria Orchestrator appliance node, execute the command below to restore the deleted docker images:

/opt/scripts/restore_docker_images.sh
Wait until the Aria Automation/Aria Orchestrator cluster is healthy and all pods are in running state.

Note: Token replay is basically a debugging feature, it allows to follow the inputs/outputs of each “token”/item in a workflow run, so either customer is using it or not, it’s additional information that doesn’t affect in any way the execution of workflows etc.

For workflows storing lots of large objects as part of the workflow execution runs, this means a large amount of vmo_tokenreplay entries.

If customer do not want to use this feature it is recommended to disable this feature to avoid excessive storage consumption you can monitor the table

The feature can be disabled by running the following command :

On each Aria Automation/Aria Orchestrator appliance node, disable the Aria Orchestrator token replay feature with this command:
mv /data/vco/usr/lib/vco/app-server/extensions/tokenreplay-8.x.0.jar /data/vco/usr/lib/vco/app-server/extensions/tokenreplay-8.x.0.jar.disable

Note: based on the Aria Automation/Aria Orchestrator product version, the filename is different. For example:

8.x: tokenreplay-8.x.0.jar

The file will be created again on each execution of /opt/scripts/deploy.sh. If you need to run the deploy.sh script, delete or rename the token replay jar file again.