Automatic database failover fails to work more than once in vRO/vRA cluster deployments
book
Article ID: 321238
calendar_today
Updated On:
Products
VMware Aria Suite
Issue/Introduction
Symptoms:
The automatic database failover fails.
None of the other two available nodes is elected as new primary database node.
In the postgres logs, you see entries similar to:
2021-01-13 11:31:19 +0000 UTC [repmgrd] unable to reconnect to node 101 after 1 attempts 2021-01-13 11:31:19 +0000 UTC [repmgrd] repmgrd on this node is paused 2021-01-13 11:31:19 +0000 UTC [repmgrd] monitoring upstream node 101 in degraded state for 0 seconds 2021-01-13 11:48:45 +0000 UTC [repmgrd] unable to ping "host=hostname.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
Environment
VMware vRealize Orchestrator 8.x
Cause
This issue occurs due to the configuration setting to pause the database failover during the vRA/vRO service start up is not properly cleaned at the end. The setting is later synchronized to the newly added nodes in the cluster and after the first failover event, further attempts fails.
Resolution
This is a known issue affecting VMware vRealize Orchestrator 8.2.x and VMware vRealize Automation 8.2.x.
Currently, there is no resolution.
Workaround: To work around this issue:
Backup all vRO/vRA nodes.
Ensure all vRO/vRA nodes are up and running by running this command:
kubectl get nodes
For example:
root@hostname6-157 [ ~ ]# kubectl get nodes NAME STATUS ROLES AGE VERSION hostname5.example.com Ready master 88m v1.18.5+vmware.1 hostname6.example.com Ready master 127m v1.18.5+vmware.1 hostname7.example.com Ready master 100m v1.18.5+vmware.1
root@hostname6-157 [ ~ ]# vracli status services Ready
On each node, modify the /opt/scripts/deploy.sh file by adding the following 2 lines:
# Delete repmgrd state, otherwise one node will start with failover paused. vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt
, right before the line containing: #runhelm postgres
Example:
# Delete repmgrd state, otherwise one node will start with failover paused. vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt
#runhelm postgres
Execute /opt/scripts/deploy.sh on one of the nodes in the vRO/vRA cluster.