OpenSearch Service Fails to Start During VMware Identity Manager Patching (Step 12)

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

When applying a patch to VMware Identity Manager (vIDM) following guides such as KB426230, the process may stall at Step 12.

The following symptoms are observed:

The OpenSearch service fails to start or initialize.
Direct access to the vIDM node's port 8443 GUI is unavailable.
System diagnostics show OpenSearch status as Unknown.

The update logs indicate a failure during the internal PostgreSQL database upgrade (e.g., from version 9.6 to 14.x).

Error while copying relation “saas.CacheEntry_expiry”: could not write file “db/data/base/#####/#####”: 
No space left on device
Failure, exiting 
Postgres upgrade from 9.6 to 14.x failed
Warning:%post(horizon-database-rpm-3.3.7.0-25163938.noarch)scriptlet failed, exit status 1

error while copying relation "saas.SuiteTokenCache_principal_clientId": could not write file "/db/data/base/#####/#####": No space left on device
Failure, exiting
Postgres upgrade from 9.6 to 14.x failed

Environment

vIDM 3.3.7

Cause

The internal database upgrade fails because the /db partition has insufficient free space. This is commonly caused by the SuiteTokenCache and/or CacheEntry tables growing excessively large. During the upgrade process, the system attempts to duplicate the database, and if the available space is less than the size of the database, the operation fails with a "No space left on device" error.

Resolution

Please ensure snapshots of the vIDM appliances have been taken before applying these steps.

For SuiteTokenCache table:
Phase 1* — Delete Expired Rows Only (No User Impact)
 
1. Connect to Postgres:

/opt/vmware/vpostgres/current/bin/psql -U postgres -d saas

2. Check table size and disk space:

SELECT pg_size_pretty(pg_total_relation_size('"SuiteTokenCache"')) AS total_size_with_indexes, 
pg_size_pretty(pg_relation_size('"SuiteTokenCache"')) AS table_data_size, (SELECT count(*) FROM 
"SuiteTokenCache") AS total_rows, (SELECT count(*) FROM "SuiteTokenCache" WHERE expires < extract(epoch
from now())) AS expired_rows, (SELECT count(*) FROM "SuiteTokenCache" WHERE expires >= extract(epoch 
from now()) AND revoked = false) AS active_rows;
**Example output:**
total_size_with_indexes | table_data_size | total_rows | expired_rows | active_rows
-------------------------+-----------------+------------+--------------+-------------
128 kB | 8192 bytes | 3 | 0 | 3
(1 row)

df -h /db/data 
du -sh /db/data

3. Phase 1 — Delete expired rows only (no user impact):

DELETE FROM "SuiteTokenCache" WHERE expires < extract(epoch from now()); 
VACUUM FULL "SuiteTokenCache";

Re-check table size and disk space using the queries from Step 2.

4. Phase 2 — Truncate Entire Table (If Phase 1 Is Not Enough) {}{}Note:{}{} This will force ALL logged-in users to re-login. No user accounts or configurations are affected.

TRUNCATE TABLE "SuiteTokenCache" CASCADE;

For CacheEntry table:

Check table size and disk space:

SELECT pg_size_pretty(pg_total_relation_size('saas."CacheEntry"')) AS total_size_with_indexes, 
pg_size_pretty(pg_relation_size('saas."CacheEntry"')) AS table_data_size, (SELECT count(*) FROM 
saas."CacheEntry") AS total_rows, (SELECT count(*) FROM saas."CacheEntry" WHERE expiry < (extract(epoch 
FROM now()) * 1000)) AS expired_rows, (SELECT count(*) FROM saas."CacheEntry" WHERE expiry >= 
(extract(epoch FROM now()) * 1000)) AS active_rows;

df -h /db/data 
du -sh /db/data

2 Phase 1* — Delete expired rows only (no user impact):

DELETE FROM saas."CacheEntry"
WHERE expiry < (extract(epoch FROM now()) * 1000); 
VACUUM FULL saas."CacheEntry";

Re-check table size and disk space using the queries from Step 2.

After confirming that available space is sufficient to duplicate both tables you can take fresh snapshots and proceed with the upgrade.