Automatic database failover fails to work more than once in vRO/vRA cluster deployments

search cancel

Automatic database failover fails to work more than once in vRO/vRA cluster deployments

book

Article ID: 321238

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

The automatic database failover fails.
None of the other two available nodes is elected as new primary database node.
In the postgres logs, you see entries similar to:

2021-01-13 11:31:19 +0000 UTC [repmgrd] unable to reconnect to node 101 after 1 attempts
2021-01-13 11:31:19 +0000 UTC [repmgrd] repmgrd on this node is paused
2021-01-13 11:31:19 +0000 UTC [repmgrd] monitoring upstream node 101 in degraded state for 0 seconds
2021-01-13 11:48:45 +0000 UTC [repmgrd] unable to ping "host=hostname.cluster.local
dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware vRealize Orchestrator 8.x

Cause

This issue occurs due to the configuration setting to pause the database failover during the vRA/vRO service start up is not properly cleaned at the end. The setting is later synchronized to the newly added nodes in the cluster and after the first failover event, further attempts fails.

Resolution

This is a known issue affecting VMware vRealize Orchestrator 8.2.x and VMware vRealize Automation 8.2.x.

Currently, there is no resolution.

Workaround:
To work around this issue:

Backup all vRO/vRA nodes.
Ensure all vRO/vRA nodes are up and running by running this command:

kubectl get nodes

For example:

root@hostname6-157 [ ~ ]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
hostname5.example.com Ready master 88m v1.18.5+vmware.1
hostname6.example.com Ready master 127m v1.18.5+vmware.1
hostname7.example.com Ready master 100m v1.18.5+vmware.1

root@hostname6-157 [ ~ ]# vracli status services
Ready
On each node, modify the /opt/scripts/deploy.sh file by adding the following 2 lines:

# Delete repmgrd state, otherwise one node will start with failover paused.
vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt

, right before the line containing:
#runhelm postgres

Example:

# Delete repmgrd state, otherwise one node will start with failover paused.
vracli cluster exec -- rm -f /data/db/live/pg_stat/repmgrd_state.txt

#runhelm postgres
Execute /opt/scripts/deploy.sh on one of the nodes in the vRO/vRA cluster.

Feedback

thumb_up Yes

thumb_down No