Aria Operations 8.x upgrade hangs/fails due to incorrect .pgpass permissions.

Products

VMware vRealize Operations 8.x VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Symptoms:

After attempting to upgrade to Aria Operations 8.x, the following error appears in the admin UI environment:
FailedPAK action "run master postgres db upgrade" failed.
/var/log/vmware/vcops/dbupgrade_*.log Includes log entries like:

2021-05-21T04:45:32 DEBUG - startCentralPostgres:198 - Start the central postgres service
2021-05-21T04:45:32 INFO - runScript:152 - Running command: /sbin/service vpostgres-repl start
2021-05-21T04:45:33 INFO - runScript:159 - stdout:
2021-05-21T04:45:33 INFO - runScript:160 - stderr: Job for vpostgres-repl.service failed because the control process exited with error code.
See "systemctl status vpostgres-repl.service" and "journalctl -xe" for details.

2021-05-21T04:45:33 INFO - runScript:161 - exit code: 1
2021-05-21T04:45:33 ERROR - runScript:165 - Script command: "/sbin/service vpostgres-repl start" failed with exit code: 1

The original cluster was deployed as 6.x
The cluster was functioning correctly before the upgrade and can be restored to its previous, stable pre-upgrade state.
On primary and/or replica nodes, the file /var/vmware/vpostgres/current/.pgpass has an invalid owner, similar to (1002:users)
The file /storage/db/vcops/recovery.conf.bootstrap is located on the current primary node.
Inside the file: /usr/lib/vmware-vcops/user/conf/persistence/persistence.properties on both the primary and replica nodes:
The repl.db.role is set to MASTER on the primary node and REPLICA on the replica node.
The repl.jdbc.url points to the address of the current primary node.

Environment

VMware Aria Operations 8.x

Aria Operations 8.18x

Cause

On replica nodes, the versions of the service script /etc/init.d/vpostgres-repl in vROps 8.3 rely on the postgres user having read access to the .pgpass file to perform the psql ("Test connection to $MASTE_IP") and pg_basebackup ("Base backup from $MASTER_IP") steps in the run_as_replica() function. Starting with vROps 8.4, the postgres replication service connects using certificates instead of relying on .pgpass for user/password connection settings.
During normal vROps usage, the invalid file ownership causes the "Test connection" and "Base backup" steps to fail, but the script logic ignores these errors and proceeds with commands that start the postgres replication database instance.
Scripts that execute vpostgres-repl during upgrades trap the same errors and cause the upgrade to fail.
Due to the issue with invalid .pgpass ownership, HA failovers may complete partially, leaving both the primary and replica nodes with the file /storage/db/vcops/recovery.conf.bootstrap. The existence of this file on the replica node triggers the vpostgres-repl service to reset the current node as the receiving side of replication. It should never exist on the primary node.

Note: starting with Aria Operations 8.12x, .pgpass and bootstrap is no longer used after switching to cert-based auth between postgres primary and replica

Resolution

This is a known issue in Aria Operations, and there is currently no resolution available. If you believe that you have encountered this issue, please raise a case with Broadcom support.

Aria Operations 8.x upgrade hangs/fails due to incorrect .pgpass permissions.

Article ID: 315909

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback