Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering

search cancel

Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering

book

Article ID: 317721

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Provide instructions on how to monitor and restore a 3 node vPostgres cluster within Kubernetes containers.

Symptoms:

The command vracli status shows multiple primary database nodes.
vPostgres is unable to elect a single master node.
prelude-noop-intnet-netcheck.log files within pods/kube-system/prelude-noop-intnet-ds-***** directories have entries similar to the below

2019/12/31 08:27:04 Failed ping for 10.244.2.2, packet loss is 100.000000
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.

3 node vRealize Automation 8.0 / 8.0.1 cluster does not have redundant network pathing

Environment

Aria Automation 8.x

Cause

3 node vPostgres clustering can breakdown due to network isolation / connectivity creating a split-brain scenario of 3 running master databases.

Resolution

Resiliency improvements will be introduced in vRealize Automation 8.1 to prevent this scenario from occurring.

Workaround:
Ensure that valid snapshots have been taken prior to performing any actions. Do not create live snapshots. For vRealize Automation 8.x, ensure cold powered down snapshots are performed per: vRealize Automation 8.x Preparations for Backing Up

It is highly encouraged to have a stringent backup procedure available on a daily schedule.

It is recommended to have redundant network pathing between ESXi hosts in which host the vRealize Automation appliance nodes.

Workaround:

On any of the vRealize Automation virtual appliance(s) run the following command once
```
vracli cluster exec -- touch /data/db/live/debug
```

Note: This will create a flag file on all cluster nodes that will then pause the database pods when they start so they can then be worked with manually.

Restart the postgres-1 and postgres-2 pods

kubectl delete pod -n prelude postgres-1; kubectl delete pod -n prelude postgres-2;

Note: This will restart the postgres-1 and postgres-2 pods. Due to the debug flags, they will stop then wait instead of starting vPostgres

Identify the node on which postgres-1 and postgres-2 pods are now running
```
kubectl get pods -n prelude -l name=postgres -o wide
```
On the node where postgres-1 is running execute the following command to remove the debug flags
```
rm /data/db/live/debug
```
Run
```
kubectl -n prelude logs -f postgres-1
```
Monitor the logs and ensure that postgres-1 discovers postgres-0 as a primary, re-syncs from it and starts working. A message similar to the below will be reported if successful
```
'[repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state' on the postgres-1 log.
```
Repeat the same for postgres-2
Finally, remove the /data/db/live/debug file on the node where postgres-0 is running

Feedback

thumb_up Yes

thumb_down No