pcp_recovery on vIDM nodes fail with error message "execution of command failed at "lst stage"."
search cancel

pcp_recovery on vIDM nodes fail with error message "execution of command failed at "lst stage"."

book

Article ID: 417854

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • The vIDM cluster is reported to be critical in vASL UI > Lifecycle Operations > Environment : Global environment repotted to be critical 
  • One or more of the nodes is reported to be down when viewing pgpool cluster using: su root -c "echo -e 'password'|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \"show pool_nodes\""
  • The remediate operation fails to perform pcp recovery on the node
  • Attempting to run pcp_recovery would fail with error:
    /usr/local/bin/pcp_recovery_node -h delegateIP -p 9898 -U pgpool -n 0
    Password:
    ERROR: executing recovery, execution of command failed at "lst stage"
    DETAIL: command: "recovery_lst_stage" 
  • The /db/data/serverlog has the below logs indicating the pgpool_recovery, pgpool_regclass tables have been removed
    • On the primary server: 
          tail -f /db/data/serverlog

      • ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed
        ERROR: function pgpool_recovery(unknown, unknown, unknown, unknown, integer) does not exist at character 8
        No function matches the given name and argument types. You might need to add explicit type casts.
        STATEMENT: SELECT pgpool_recovery('recovery_1st_stage', '<delegate_ip>', '/db/data', '5432', 0)
        ERROR: function pgpool_recovery(unknown, unknown, unknown, unknown, integer) does not exist at character 8
        HINT: No function matches the given name and argument types. You might need to add explicit type casts. STATEMENT: SELECT pgpool_recovery ('recovery_1st_stage', '<delegate_ip>', '/db/data', '5432', 0)
        ------

      • The same can be validated the same with with : tail -f /db/data/serverlog

        HINT: No function matches the given name and argument types. You might need to add explicit type casts.
        STATEMENT: SELECT pgpool_recovery ('recovery_1st_stage', '<delegate_ip>', '/db/data', '5432', 0)
        ERROR: relation "pgpool recovery" does not exist at character 15
        STATEMENT: select * from pgpool recovery;
        ERROR: function pgpool_recovery (unknown, unknown, unknown, unknown, integer) does not exist at character 8
        HINT: No function matches the given name and argument types. You might need to add explicit type casts.
        HINT: No function matches the given name and argument types. You might need to add explicit type casts.

    • On the standby node:
          tail -f /db/data/serverlog

      • LOG: started streaming WAL from primary at 146/48000000 on timeline 6
        FATAL: could not receive data from WAL stream: ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed

        LOG: started streaming WAL from primary at 146/48000000 on timeline 6
        FATAL: could not receive data from WAL stream: ERROR: requested WAL segment xxxxxxxxxxxxxxxxxxxxxxxx has already been removed

         LOG: received fast shutdown request
         LOG: aborting any active transactions
         LOG: shutting down
         LOG: database system is shut down

Environment

  • VMware Identity Manager 3.3.7

Cause

  • The cluster has lost it's capability to stabilize and run a pcp recovery, when auto recovery is run from Aria Suite Lifecycle due to the pgpool_recovery, pgpool_regclass tables having been removed .

Resolution

Resolution:

  • Ideally, it is recommended to revert to vIDM cluster to a healthy snapshot and initiate a Remediate or a Power On for the global environment from vASL.

Workaround:

  • If there is no snapshot of the cluster in a prior healthy state, manually create the db extensions.
    ------------------------------------------
    Steps to UNDO The prepare-vidm-patch.sh script
    ------------------------------------------
    • Execute the below command on all nodes:
      • /etc/init.d/pgService start
    • Execute the below command only on primary:
      • /opt/vmware/vpostgres/current/bin/psql -h localhost  -U postgres -d template1 -c "CREATE EXTENSION  IF NOT EXISTS pgpool_recovery WITH SCHEMA pg_catalog;"
      • /opt/vmware/vpostgres/current/bin/psql -h localhost  -U postgres -d template1 -c "CREATE EXTENSION  IF NOT EXISTS pgpool_regclass WITH SCHEMA pg_catalog;"
      • /etc/init.d/NetworkService start

  • Manually run pcp_recovery for the standby nodes: 
    • Stop vpostgres service on all the standby nodes:
      • /etc/init.d/vpostgres stop
    • Run below command on the primary node:
      • /usr/local/bin/pcp_recovery_node -h delegateIP -p 9898 -U pgpool -n node_id

           Command parameter help
            -h : The affected host on which the command would be run, Use as is. (delegateIP : This is keyword. Need not be changed with IP. Use as is.)
            -p : Port on which PCP process accepts connections, which is 9898
            -U : The Pgpool user, which is pgpool
            -n : Node id which needs to be recovered. <node_id> will be the node that is being corrected. This can obtained from 'node_id' column from the show pool_nodes command.
            pgpool : This is pgpool user. Need not be changed. Use as is.
            The above command would prompt for a password. Enter Password as "password" if the /usr/local/etc/pgpool.pwd password fails to connect.

        Expected response
        pcp_recovery_node -- Command Successful

  • Trigger Inventory Sync from vASL to vIDM and validate the request completes successfully (The Health status would sync up on the next run, unless we manually trigger a 'Trigger Cluster Health' Request)
  • Log into the vIDM portal and validate cluster health.