Troubleshooting VMware Identity Manager postgres cluster deployed through Aria Suite Lifecycle (vRSLCM)
search cancel

Troubleshooting VMware Identity Manager postgres cluster deployed through Aria Suite Lifecycle (vRSLCM)

book

Article ID: 367175

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

VMware Identity Manager cluster health status is shown as CRITICAL in Aria Suite Lifecycle Health Notification.

Environment

VMware Identity Manager 3.3.7

Resolution

From VMware vRealize Suite Lifecycle Manager cluster health can be remediated by clicking on the REMEDIATE CLUSTER button on the environment screen.

If the relevant option above does not resolve proceed with following:

  1. Ensure vRSLCM's inventory is up-to-date. Sometimes there will be a drift/changes in the environment outside vRSLCM and for vRSLCM to know the current state of the system, its inventory needs to be updated. To do so, its advisable to perform an inventory sync of vIDM (global environment) in vRSLCM by clicking on the Trigger Inventory Sync button.

    1. VRLCM inventory sync will not complete if the VIDM is not accessible logging into it. If remediation task failed because of log attempt as root or SSH user could not be completed then proceed to step "Remediate cluster health"
  2. For the notification to be precise, vRSLCM should be able to reach all the vIDM nodes and also be able to log-in to all nodes to perform health checks.

    1. Make sure all the vIDM nodes in the cluster are powered ON and running. If any of the nodes are powered OFF, then the health notification in vRSLCM would show a description message similar to
      Host(s) <IPs> not reachable from vRSLCM.
    2. Ensure vRSLCM is able to ssh into each vIDM node, if not a notification with a description message similar to
      Unable to get postgres status from any of the nodes in the cluster.
      Unable to get pgpool status
      if postgres is unavailable from any of the nodes in the cluster.
    3. Ensure vRSLCM has the right root passwords else the following note will be displayed
      root passwords are expired

    In any of these cases, ensure that vRSLCM can reach vIDM nodes and all the nodes credentials are updated in vRSLCM's inventory.

    Note: vRSLCM does a ping to all the vIDM nodes to check if they are available or not.
  3. Once inventory is updated, cURL for the current cluster health status

    curl -H "Authorization: Basic token" -k https://vRSLCMFQDN/lcm/authzn/api/vidmcluserhealth

    Syntax Help:

    vRSLCMFQDN: The hostname / IP of vRSLCM managing the vIDM cluster.

    token: Run the following command to get the Base64 encoded value of username:password. Here username is admin@local, and password is admin@local user's password.

    echo -n 'admin@local:password' | base64
    Note: On VCF mode replace admin@local with vcfadmin@local and its respective password.

    Note: The API will trigger a request to re-calculate the cluster health which will generate a notification on the current overall status in vRSLCM.

    Note: The automatic health check triggered by vRSLCM against a vIDM cluster is on a 1 hour default.

Remediate cluster health

Note: If an install, upgrade or scale out request is IN PROGRESS or FAILED state in vRSLCM. Open a case to assist with the upgrade or install

Open SSH session to all three VIDM appliances and run the following command:

cat /usr/local/etc/pgpool.pwd

If this file returns a value then use this value in future steps as the password. If no value is returned, then the password will be "password" for every step.

  1. Ensure the pgpool service is running on all the nodes in the cluster

    /etc/init.d/pgService status

    Expected Response

    Getting pgpool service status
    pgpool service is running
    Note: This is critical since the pgpool service is responsible for orchestrating the postgres fail-over. Without pgpool service running on a given node, quorum will not be maintained and the node will not be understood to be part of the cluster.

    If the response is not as expected and pgpool service is not running, you will receive an error similar to

    Pgpool service not running on the node(s) <IPs> in the cluster.

    Start the service if it is not running

    /etc/init.d/pgService start
  2. To find the pgpool MASTER, run the below command on any of the vIDM nodes

    su root -c "echo -e 'password'|/usr/local/bin/pcp_watchdog_info -p 9898 -h localhost -U pgpool"
    Note: vIDM versions 3.3.2 and below will use the explicit string text of password in echo -e 'password'. For vIDM versions 3.3.3 and above, if /usr/local/etc/pgpool.pwd exists, use this value in "echo -e 'password'.

    Command parameters help

    -h : The host against which the command would be run, here its 'localhost'.
    -p : Port on which pgpool accepts connections, which is 9898
    -U : The Pgpool health check and replication delay check user, which is pgpool

    Expected Response

    3 YES <Host1>:9999 Linux <Host1> <Host1>
    
    <Host1>:9999 Linux <Host1> <Host1> 9999 9000 4 MASTER
    <Host2>:9999 Linux <Host2> <Host2> 9999 9000 7 STANDBY
    <Host3>:9999 Linux <Host3> <Host3> 9999 9000 7 STANDBY

    In the response, there needs to be a MASTER node present else vRSLCM will pop-up a notification with description message similar to

    Could not find a pgpool master node, as all nodes are in standby state.
  3. To check postgres status on all the vIDM nodes

    Run the command on the pgpool MASTER node found above in Step #2, this will generate a list of all configured nodes with corresponding postgres status.

    su root -c "echo -e 'password'|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \"show pool_nodes\""

    Command parameters help

    -h : The host against which the command would be run, here it would be 'localhost'
    -p : The port on which Pgpool accepts connections, here its 9999
    -U : The Pgpool user, which is pgpool
    -c : The command to run, which is 'show pool_nodes'

    Expected response

     node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change
    ---------+---------------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
     0 | Host1 | 5432 | up | 0.333333 | primary | 0 | false | 0 | 2019-10-14 06:05:42
     1 | Host2 | 5432 | up | 0.333333 | standby | 0 | false | 0 | 2019-10-14 06:05:42
     2 | Host3 | 5432 | up | 0.333333 | standby | 0 | true | 0 | 2019-10-14 06:05:42
    (3 rows)

    Role column - Informs postgres master node as known to pgpool service

    Status column - Informs postgres status as known to pgpool service

    Note: In an ideal scenario, there must be at least one postgres master node, if so, proceed to the next step.

    Note: Both the master node returned by the pcp_watchdog_info command and the primary returned by the show pool_nodes command should be the same node. If they are not, use /etc/init.d/vpostgres stop on the two non-master nodes to force the master to be the primary. Once the show pool_nodes command shows the primary role is on the master you can use /etc/init.d/vpostgres start to start postgres on the other two nodes again.

    Note: If none of the nodes are marked primary in the role column, vRSLCM will provide a notification with a message similar to Could not find a postgres master node, as all nodes are in standby state.

    Warning: This is a special case. Proceed go to the special cases section. Once complete, proceed to Step #4.
  4. Ensure the delegate IP is properly assigned. This will be the free IP that user provided while clustering vIDM via vRSLCM for the delegate IP field.

    vRLCM's postgres clustering solution assigns delegate IP to the postgres master node, so at any point in time the node which is marked master should hold the delegate IP. Check the role's column from the above show pool_nodes command response run on Step #3, for the postgres master node and ensure delegateIP is assigned on the same node.

    Run the following command on the postgres master node, to make sure delegate IP is still assigned

    ifconfig eth0:0 | grep 'inet addr:' | cut -d: -f2

    Expected response

    delegateIP Bcast

    delegateIP: Make sure this is the same IP provided while clustering vIDM via vRSLCM for the delegate IP field. If so, skip this and continue to the step #5.

    If the above command does not return any output, it would mean delegateIP is not assigned on the master postgres node. In such event, the health notification in vRSLCM would show a description message similar to

    DelegateIP IP is not assigned to any nodes in the cluster.

    Make sure the delegateIP is not held by any other non-master nodes by running above ifconfig command on other nodes. If any of the non-master nodes still have the delegateIP run the following command first to detach it

    ifconfig eth0:0 down

    Run the below command on the master node to re-assign the delegateIP

    ifconfig eth0:0 inet delegateIP netmask Netmask

    delegateIP - This is a keyword and need not be substituted with the actual IP

    Netmask - Netmask currently configured for this node

    Once the above command is successful, ensure the delegateIP is assigned to eth0:0 by running

    ifconfig -a

    and make sure eth0:0 is holding the expected delegateIP.

    Note: Assigning the delegateIP explicitly needs a horizon-service restart on all the nodes. Run the following command to restart horizon service on all vIDM nodes
    service horizon-workspace restart
  5. From Step # 3's show pool_nodes response

    1. If any of the vIDM nodes in the cluster are marked as down in the 'status' column, vRSLCM would pop-up a notification with description message similar to 'Postgres status of node(s) <IPs> marked as down in the cluster.'

    This generally happens when the postgres on the node was down (or) the node went for a reboot (or) the node went out of network (or) even during a network glitch.

    1. If the replication_delay column has a value greater then 0, a node is not able to sync its data from the current postgres master. In such case, vRSLCM would pop-up a notification with a description message similar to

      Node(s) <IPs> have a replication delay with respect to the postgres master node.

    In both the above cases, the nodes would not be syncing its data from the master node and will not be able to participate in further master election in case of failover. Run the below command to recover/catch-up with master on all such nodes. This command will allow the standby nodes to be in-sync with the postgres master node and will mark the node as up in the cluster.

    Follow below commands on the affected node (Node which has replication_delay OR showing as "down")

    1. SSH to the affected Node
    2. Stop the Postgres on the affected Node by below command:

      To stop postgres for VMware Identity Manager 3.3.2

      service vpostgres stop

      To stop postgres for VMware Identity Manager 3.3.3 or later

      /etc/init.d/vpostgres stop
    3. Then Run below command on the **primary** node:

      /usr/local/bin/pcp_recovery_node -h delegateIP -p 9898 -U pgpool -n node_id

      Command parameter help

      -h : The affected host on which the command would be run, Use as is. (delegateIP : This is keyword. Need not be changed with IP. Use as is.)
      -p : Port on which PCP process accepts connections, which is 9898
      -U : The Pgpool user, which is pgpool
      -n : Node id which needs to be recovered. <node_id> will be the node that is being corrected. This can obtained from 'node_id' column from the show pool_nodes command.
      pgpool : This is pgpool user. Need not be changed. Use as is.

      The above command would prompt for a password. Enter Password as "password" if the /usr/local/etc/pgpool.pwd password fails to connect.

      Expected response

      pcp_recovery_node -- Command Successful

      Ensure to run the show pool_nodes command from Step # 3 to ensure all nodes are having the status column as up and the replication_delay column should have a value of 0.

      Also, check the Postgres Status on the node by: /etc/init.d/vpostgres status (For vIDM 3.3.3 or Later). If Postgres is stopped, start it by: /etc/init.d/vpostgres Start

      Note: If the above command execution is not successful follow step 6.
  6. Check if the file /etc/hosts in all the vIDM cluster nodes has the same master host entry,

    If master host entry is found missing on any of the nodes or if it is inconsistent across the nodes, follow the below steps

    1. Stop pgpool service on all the nodes (replica and then the master)

      /etc/init.d/pgService stop
    2. Check whether pgpool service is stopped, run the command

      /etc/init.d/pgService status
    3. Correct the master host entry in the file /etc/hosts.

    Note: Master host can be found by executing the command in Step #2. Ensure master host entry should be the same in all the vIDM nodes in the cluster.
    1. Start pgpool service on all the nodes (master and then the replica), run the command

      /etc/init.d/pgService start
    2. Check the status of the pgpool service, run the command

      /etc/init.d/pgService status

    Repeat Step # 5 to bring up the nodes in the cluster.

  7. Once all the above checks and corrections are done, re-trigger the health check API mentioned in Step #3 of the Prerequisites section of this article to ensure the cluster health notification turns GREEN displaying a message similar to

    vIDM postgres cluster health status is ok

Special cases

When all the nodes are marked standby in the role column from show pool_nodes command response in Step #3, it means there is no postgres master elected and it is a CRITICAL state. vIDM will not function. Choose one of the nodes as postgres master and the other nodes as replicas.

Remediation depends on the value present in status column. Choose any one node which is marked as up in status column. This will be the new postgres master being elected.

To promote the chosen node as postgres master, run the below command on the that node

/usr/local/etc/failover.sh <Old master node_id> <Old master node_hostname> 5432 /db/data <choosen node_id> <choosen node_hostname> <Old master node_id> <Old master node_hostname> 5432 /db/data

If the Horizon.log shows an UnknownHostException, check /etc/hosts and make sure the delegateIP hostname is set to the IP of the delegateIP

Caused by: java.net.UnknownHostException: delegateIP
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) ~[?:1.8.0_352]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_352]
    at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_352]
    at org.postgresql.core.PGStream.createSocket(PGStream.java:231) ~[postgresql-42.2.26.jar:42.2.26]
    at org.postgresql.core.PGStream.<init>(PGStream.java:95) ~[postgresql-42.2.26.jar:42.2.26]

Command help

node_id and hostname can be obtained from the 'show pool_nodes' response in step #3
<choosen node_id> - The node_id of the choosen node, which is marked as 'up' in the 'status' column.
<choosen node_hostname> - The hostname of the current choosen node
<Old master node_id> - This will be the node_id of the previous postgres master node. If not known, this can be any one of the unavailable ones.
<Old master node_hostname> - This will be the hostname of the previous postgres master node. If not known, this can be any on of the unavailable ones.

Once the command completes, stop pgpool service on all the nodes. Now sequentially start pgpool service on the new postgres master chosen above, followed by other nodes

/etc/init.d/pgService stop
/etc/init.d/pgService start

Bring up the other nodes and follow the instructions present in Steps #4,5,6 to sync them against the new postgres master node.

Running remediation commands using script

Note: Remediation scripts work only if VMware Identity Manager is installed using vRealize Suite Lifecycle Manager 8.2 and above.
  1. Download the utility script KB_75080_vidm_postgres_cluster_remediation.sh from the attachments to /root partition on the VIDM node.
  2. Change file permission of the script by executing the below command
    Command: chmod 777 KB_75080_vidm_postgres_cluster_remediation.sh
  3. Execute the below command to find the command options which are executable using the utility script
    Command: ./KB_75080_vidm_postgres_cluster_remediation.sh --help
  4. To see the pgpool status of vIDM cluster nodes or to find the pgpool master execute the below command
    Command: ./KB_75080_vidm_postgres_cluster_remediation.sh --show_pgpool_status
  5. To see the postgres status of vIDM cluster nodes execute the below command
    Command: ./KB_75080_vidm_postgres_cluster_remediation.sh --show_postgres_status
  6. To recover/ bring up the vIDM nodes in the cluster execute the below command
    Command: ./KB_75080_vidm_postgres_cluster_remediation.sh --pcp_recovery <node_id>
    Command parameters help :
    <node_id> will be the node that is being corrected. This can obtained from 'node_id' column from the show pool_nodes command.
 

Attachments

KB_75080_vidm_postgres_cluster_remediation.sh get_app