VMware Aria Operations cluster status is stuck at going online due to issues with vpostgres-repl service
search cancel

VMware Aria Operations cluster status is stuck at going online due to issues with vpostgres-repl service

book

Article ID: 337138

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This article covers the steps to bring the VMware Aria Operations (formerly known as vRealize Operations) cluster online when there are incorrect ownership/permissions on the postgres files.

You might experience following:

  • Cluster is stuck at going online for a long time
  • Nodes have "Waiting on Analytics" status
  • On the Replica node, analytics service keeps restarting as postgres-repl DB initialization does not complete
  • In the /storage/log/vcops/log/analytics-wrapper.log, you may find entries similar to:
    2022/11/03 02:33:08 | INFO   | jvm 1    | 2022-11-03T02:33:08,708+0000 [23175] - root - ERROR: Script command: "['/sbin/service', 'vpostgres-repl', 'start']" failed with exit code: "1"
    2022/11/03 02:33:08 | INFO   | jvm 1    | Failed to start vpostgres-repl result: SubprocessResponse(success=False, rc=1, stderr='Job for vpostgres-repl.service failed because the control process exited with error code.\nSee "systemctl status vpostgres-repl.service" and "journalctl -xe" for details.\n', stdout='')
  • On the Replica node, vpostgres-repl service fails to start
  • In the /var/log/vmware/vcops/vcops-services-startup.log, you may find entries similar to:
    Running /etc/init.d/vmware-vcops start vpostgres-repl  at: Mon Dec  6 18:05:25 UTC 2021, pid: 3402
    Slice Online-true
    admin Role Enabled-true
    Reset vRealize Operations vPostgres Replication Database (vpostgres-repl)...
    Test connection to ###.###.###.###...
    Failed testing connection to ###.###.###.###
    cp: cannot stat '/storage/db/vcops/vpostgres/repl/postmaster.pid': No such file or directory
    chmod: cannot access '/usr/lib/vmware-vcops/user/conf/persistence/vpostgres-repl.pid': No such file or directory
    data Role Enabled-true
    ui Role Enabled-true
    remote collector Role Enabled-false
    Completed /etc/init.d/vmware-vcops start vpostgres-repl  at: Mon Dec  6 18:05:26 UTC 2021, pid: 3402
    Job for vpostgres-repl.service failed because the control process exited with error code.
    See "systemctl status vpostgres-repl.service" and "journalctl -xe" for details.
  • In the /storage/db/vcops/vpostgres/repl/pg_log/postgresql-xx.log, you may find entries similar to:
    2022-11-03 02:31:39.466 UTC    20828 1 6363280b.515c LOG:  database system was shut down at 2022-11-03 02:19:30 UTC
    2022-11-03 02:31:39.499 UTC    20826 6 6363280b.515a LOG:  database system is ready to accept connections
    2022-11-03 02:33:07.579 UTC    20826 7 6363280b.515a LOG:  received fast shutdown request
    2022-11-03 02:33:07.583 UTC    20826 8 6363280b.515a LOG:  aborting any active transactions
    2022-11-03 02:33:07.584 UTC    20826 9 6363280b.515a LOG:  background worker "logical replication launcher" (PID 20834) exited with exit code 1
    2022-11-03 02:33:07.585 UTC    20829 1 6363280b.515d LOG:  shutting down
    2022-11-03 02:33:07.621 UTC    20826 10 6363280b.515a LOG:  database system is shut down
  • Starting the vpostgres-repl service manually with "/etc/init.d/vpostgres-repl start" will fail and give the same error.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Aria Operations 8.x

Cause

This issue occurs when there are incorrect ownership/permissions on the postgres files.

Resolution

Please contact Broadcom Support to review and get assistance, and refer to this KB.