vIDM PostgreSQL PCP recovery fails with “Execution of Command Failed at 1st Stage

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

PostgreSQL PCP recovery fails in a VMware Identity Manager environment during patching or cluster remediation.

Standby nodes appear as DOWN in the pgpool status output.
Running the PCP recovery command fails with an error similar to the following:

/usr/local/bin/pcp_recovery_node -h <delegateIP> -p 9898 -U pgpool -n <node_id> Execution of command failed at 1st stage - recovery_1st_stage
The server.log file reports missing function errors, such as:

ERROR: function pgpool_recovery(unknown, unknown, unknown, unknown, unknown, integer) does not exist at character 8 HINT: No function matches the given name and argument types.
PostgreSQL, pgService, and Horizon Workspace services remain unstable.
The Load Balancer returns 502 errors, and the OpenSearch service is unavailable.
Only the primary node remains operational, preventing patching or remediation from proceeding.

Environment

VMware Identity Manager 3.3.7

Cause

The PCP recovery process fails due to one or both of the following reasons:

Incorrect file ownership of the recovery stage file (/db/data/recovery_1st_stage) — the file is owned by root instead of postgres.
Missing PostgreSQL extension pgpool_recovery, which prevents pgpool from executing the required recovery function.

Resolution

1. Correct file ownership

Log in to the affected node and update the ownership of the recovery stage file as shown below:

Change the ownership from root to postgres:users.

2. Create the missing pgpool extension

Create the pgpool_recovery extension manually if it does not exist on the Primary node:

/opt/vmware/vpostgres/current/bin/psql -h localhost -U postgres -d template1 \ -c "CREATE EXTENSION IF NOT EXISTS pgpool_recovery WITH SCHEMA pg_catalog;"

3. Restart PostgreSQL and pgpool services on all the nodes.

Restart the PostgreSQL and pgpool services to apply changes:

/etc/init.d/vpostgres restart
/etc/init.d/pgService restart

4. Verify node status.

Check the pgpool node status:

/usr/local/bin/pcp_node_info -h <delegateIP> -p 9898 -U pgpool -n all

All nodes should now show as UP.

5. Retry PCP recovery on the primary.

Re-run the PCP recovery process on the primary node:

/usr/local/bin/pcp_recovery_node -h <delegateIP> -p 9898 -U pgpool -n <node_id>

Note: The command should now complete successfully.

6. Validate vIDM cluster health

Ensure all services are stable and the environment is synchronized:

/etc/init.d/vpostgres status
/etc/init.d/pgService status
/etc/init.d/opensearch status

7. Sync LCM inventory

Run an inventory sync in LCM to validate connectivity.