Automatic database failover fails to work more than once in vRO/vRA cluster deployments
search cancel

Automatic database failover fails to work more than once in vRO/vRA cluster deployments

book

Article ID: 321238

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

  • The automatic database failover fails.
  • None of the other two available nodes is elected as new primary database node.
  • In the postgres logs, you see entries similar to:

    2021-01-13 11:31:19 +0000 UTC [repmgrd] unable to reconnect to node 101 after 1 attempts
    2021-01-13 11:31:19 +0000 UTC [repmgrd] repmgrd on this node is paused
    2021-01-13 11:31:19 +0000 UTC [repmgrd] monitoring upstream node 101 in degraded state for 0 seconds
    2021-01-13 11:48:45 +0000 UTC [repmgrd] unable to ping "host=hostname.cluster.local 
    dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"


    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.



Environment

VMware vRealize Orchestrator 8.x

Cause

This issue occurs due to the configuration setting to pause the database failover during the vRA/vRO service start up is not properly cleaned at the end. The setting is later synchronized to the newly added nodes in the cluster and after the first failover event, further attempts fails.

Resolution

This is a known issue affecting VMware vRealize Orchestrator 8.2.x and VMware vRealize Automation 8.2.x.

Currently, there is no resolution.

Workaround:
To work around this issue:

  1. Backup all vRO/vRA nodes.
  2. Ensure all vRO/vRA nodes are up and running by running this command:

    kubectl get nodes

    For example:

    root@hostname6-157 [ ~ ]# kubectl get nodes
    NAME                                          STATUS     ROLES    AGE    VERSION
    hostname5.example.com                         Ready      master   88m    v1.18.5+vmware.1
    hostname6.example.com                         Ready      master   127m   v1.18.5+vmware.1
    hostname7.example.com                         Ready      master   100m   v1.18.5+vmware.1

    root@hostname6-157 [ ~ ]# vracli status services
    Ready

     
  3. On each node, modify the /opt/scripts/deploy.sh file by adding the following 2 lines:

    # Delete repmgrd state, otherwise one node will start with failover paused.
    vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt

    , right before the line containing:
    #runhelm postgres


    Example:

    # Delete repmgrd state, otherwise one node will start with failover paused.
    vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt

    #runhelm postgres

     
  4. Execute /opt/scripts/deploy.sh on one of the nodes in the vRO/vRA cluster.