VMware Aria Operations cluster status is stuck at going online due to issues with vpostgres-repl service

search cancel

VMware Aria Operations cluster status is stuck at going online due to issues with vpostgres-repl service

book

Article ID: 337138

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This article covers the steps to bring the VMware Aria Operations (formerly known as vRealize Operations) cluster online when there are incorrect ownership/permissions on the postgres files.

You might experience following:

Cluster is stuck at going online for a long time
Nodes have "Waiting on Analytics" status
On the Replica node, analytics service keeps restarting as postgres-repl DB initialization does not complete

In the /storage/log/vcops/log/analytics-wrapper.log, you may find entries similar to:

2022/11/03 02:33:08 | INFO   | jvm 1    | 2022-11-03T02:33:08,708+0000 [23175] - root - ERROR: Script command: "['/sbin/service', 'vpostgres-repl', 'start']" failed with exit code: "1"
2022/11/03 02:33:08 | INFO   | jvm 1    | Failed to start vpostgres-repl result: SubprocessResponse(success=False, rc=1, stderr='Job for vpostgres-repl.service failed because the control process exited with error code.\nSee "systemctl status vpostgres-repl.service" and "journalctl -xe" for details.\n', stdout='')

On the Replica node, vpostgres-repl service fails to start

In the /var/log/vmware/vcops/vcops-services-startup.log, you may find entries similar to:

Running /etc/init.d/vmware-vcops start vpostgres-repl  at: Mon Dec  6 18:05:25 UTC 2021, pid: 3402
Slice Online-true
admin Role Enabled-true
Reset vRealize Operations vPostgres Replication Database (vpostgres-repl)...
Test connection to ###.###.###.###...
Failed testing connection to ###.###.###.###
cp: cannot stat '/storage/db/vcops/vpostgres/repl/postmaster.pid': No such file or directory
chmod: cannot access '/usr/lib/vmware-vcops/user/conf/persistence/vpostgres-repl.pid': No such file or directory
data Role Enabled-true
ui Role Enabled-true
remote collector Role Enabled-false
Completed /etc/init.d/vmware-vcops start vpostgres-repl  at: Mon Dec  6 18:05:26 UTC 2021, pid: 3402
Job for vpostgres-repl.service failed because the control process exited with error code.
See "systemctl status vpostgres-repl.service" and "journalctl -xe" for details.

In the /storage/db/vcops/vpostgres/repl/pg_log/postgresql-xx.log, you may find entries similar to:

2022-11-03 02:31:39.466 UTC    20828 1 6363280b.515c LOG:  database system was shut down at 2022-11-03 02:19:30 UTC
2022-11-03 02:31:39.499 UTC    20826 6 6363280b.515a LOG:  database system is ready to accept connections
2022-11-03 02:33:07.579 UTC    20826 7 6363280b.515a LOG:  received fast shutdown request
2022-11-03 02:33:07.583 UTC    20826 8 6363280b.515a LOG:  aborting any active transactions
2022-11-03 02:33:07.584 UTC    20826 9 6363280b.515a LOG:  background worker "logical replication launcher" (PID 20834) exited with exit code 1
2022-11-03 02:33:07.585 UTC    20829 1 6363280b.515d LOG:  shutting down
2022-11-03 02:33:07.621 UTC    20826 10 6363280b.515a LOG:  database system is shut down

Starting the vpostgres-repl service manually with "/etc/init.d/vpostgres-repl start" will fail and give the same error.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Aria Operations 8.x

Cause

This issue occurs when there are incorrect ownership/permissions on the postgres files.

Resolution

Please contact Broadcom Support to review and get assistance, and refer to this KB.

Feedback

thumb_up Yes

thumb_down No