The appliance disks fill up frequently due to vmo_workflowtokencontent growing rapidly when using VMware Aria Automation or Automation Orchestrator

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

The VMware Aria Automation or Automation Orchestrator cluster is not healthy and some pods fail to initialize with status:
```
Init:ErrImageNeverPull

NAME READY STATUS RESTARTS AGE
Pod-name 0/1 Init:ErrImageNeverPull 12 14h
```

The data partition (/data) disk usage is above 80% and fills up frequently, even if more disk space is added:

root@vra-appliance [ / ]# df -h /data

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data_vg-data 196G 157G 30G 85% /data

The postgreSQL database vco-db is more than few gigabytes and is growing fast:

template1=# SELECT pg_database.datname as "database_name", pg_database_size(pg_database.datname)/1024/1024 AS size_in_mb FROM pg_database ORDER by size_in_mb DESC;
   database_name | size_in_mb
--------------------+------------
 vco-db | 77000
 provisioning-db | 66
 catalog-db | 12
 ...

The vmo_workflowtokencontent table in the "vco-db" postgreSQL database is more than a few gigabytes and is growing fast:

template1=# \c vco-db
You are now connected to database "vco-db" as user "postgres".
vco-db=# SELECT
   relname as "Table",
   pg_size_pretty(pg_total_relation_size(relid)) As "Size"
   FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;
          Table | Size
-------------------------+-------------
 vmo_workflowtokencontent| 53536123 kB
 vmo_vroconfiguration | 544 kB
 vmo_scriptmodule | 520 kB
 vmo_scriptmodulecontent | 464 kB
 vmo_contentsignature | 456 kB

Environment

VMware vRealize Orchestrator 8.x
VMware vRealize Automation 8.x

Cause

This issue is caused by an out of box Kubernetes behavior, when the data disk usage goes above 80% and Kubernetes attempts to free some space by deleting the available docker images. This will cause some of the services to fail. The root cause of the large disk usage is a combination of abnormally large data (inputs, outputs) used in the Orchestrator workflows and failures in the automatic database auto-vacuuming.

Resolution

This is a known issue affecting VMware Aria Automation and VMware Aria Automation Orchestrator 8.x
Currently, there is no resolution. There is a workaround.

Workaround:

Procedure

Increase the data disk size (Disk2) on each appliance with sufficient space to perform manual vacuum full on the vmo_workflowtokencontent table.

Note: The vacuum full command does a full copy of the vacuumed table, hence you need to increase the disk with the minimum size of the current vmo_workflowtokencontent table.

Backup all appliance nodes without stopping them.
On each appliance node, execute the command to restore the deleted docker images:

/opt/scripts/restore_docker_images.sh

Wait until the vRA/vRO cluster is healthy and all pods are in running state.
Connect to the postgres database "vco-db" and run vacuum full on the "vmo_workflowtokencontent" table:

VACUUM (FULL, VERBOSE, ANALYZE) vmo_workflowtokencontent;

Example:
root@vra-appliance [ ~ ]# vracli dev psql
....

template1=# \c vco-db

You are now connected to database "vco-db" as user "postgres".

vco-db=# VACUUM (FULL, VERBOSE, ANALYZE) vmo_workflowtokencontent;
INFO: vacuuming "public.vmo_workflowtokencontent"
INFO: "vmo_workflowtokencontent": found 0 removable, 17 nonremovable row versions in 2 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: analyzing "public.vmo_workflowtokencontent"
INFO: "vmo_workflowtokencontent": scanned 2 of 2 pages, containing 17 live rows and 0 dead rows; 17 rows in sample, 17 estimated total rows
VACUUM

Update the auto-vacuum configuration to allow more frequent vacuum attempts:

alter table vmo_workflowtokencontent set (autovacuum_vacuum_scale_factor = 0.05);
alter table vmo_workflowtokencontent set (autovacuum_vacuum_threshold = 25);
alter table vmo_workflowtokencontent set (autovacuum_vacuum_cost_delay = 10);
alter table vmo_workflowtokencontent set (autovacuum_analyze_threshold = 25);
alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_scale_factor = 0.05);
alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_threshold = 25);
alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_cost_delay = 10);

Validate the new auto-vacuum scale factor settings:

vco-db=# SELECT relname, pg_options_to_table(reloptions) AS reloption
    FROM pg_class
    WHERE reloptions IS NOT NULL
        AND relnamespace = 'public'::regnamespace
    ORDER BY relname, reloption;

Verify that manual vacuum full command reduced the vmo_workflowtokencontent table size and that the /data disk now has enough free space.