Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering
search cancel

Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering

book

Article ID: 317721

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Provide instructions on how to monitor and restore a 3 node vPostgres cluster within Kubernetes containers.

Symptoms:
  • The command vracli status shows multiple primary database nodes.
  • vPostgres is unable to elect a single master node.
  • prelude-noop-intnet-netcheck.log files within pods/kube-system/prelude-noop-intnet-ds-***** directories have entries similar to the below
2019/12/31 08:27:04 Failed ping for 10.244.2.2, packet loss is 100.000000
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.
  • 3 node vRealize Automation 8.0 / 8.0.1 cluster does not have redundant network pathing


Environment

Aria Automation 8.x

Cause

3 node vPostgres clustering can breakdown due to network isolation / connectivity creating a split-brain scenario of 3 running master databases.

Resolution

Resiliency improvements will be introduced in vRealize Automation 8.1 to prevent this scenario from occurring.

 


Workaround:
Ensure that valid snapshots have been taken prior to performing any actions.  Do not create live snapshots.  For vRealize Automation 8.x, ensure cold powered down snapshots are performed per:  vRealize Automation 8.x Preparations for Backing Up

It is highly encouraged to have a stringent backup procedure available on a daily schedule.

It is recommended to have redundant network pathing between ESXi hosts in which host the vRealize Automation appliance nodes.

Workaround:

  1. On any of the vRealize Automation virtual appliance(s) run the following command once
    vracli cluster exec -- touch /data/db/live/debug
Note: This will create a flag file on all cluster nodes that will then pause the database pods when they start so they can then be worked with manually.
  1. Restart the postgres-1 and postgres-2 pods
    kubectl delete pod -n prelude postgres-1; kubectl delete pod -n prelude postgres-2;
Note: This will restart the postgres-1 and postgres-2 pods. Due to the debug flags, they will stop then wait instead of starting vPostgres
  1. Identify the node on which postgres-1 and postgres-2 pods are now running
    kubectl get pods -n prelude -l name=postgres -o wide
  2. On the node where postgres-1 is running execute the following command to remove the debug flags
    rm /data/db/live/debug
  3. Run
    kubectl -n prelude logs -f postgres-1
  4. Monitor the logs and ensure that postgres-1 discovers postgres-0 as a primary, re-syncs from it and starts working. A message similar to the below will be reported if successful
    '[repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state' on the postgres-1 log.
  5. Repeat the same for postgres-2
  6. Finally, remove the /data/db/live/debug file on the node where postgres-0 is running