Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering
search cancel

Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering

book

Article ID: 317721

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Provide instructions on how to monitor and restore a 3 node vPostgres cluster within Kubernetes containers.

Symptoms:
  • The command vracli status shows multiple primary database nodes.
  • vPostgres is unable to elect a single master node.
  • prelude-noop-intnet-netcheck.log files within pods/kube-system/prelude-noop-intnet-ds-***** directories have entries similar to the below
2019/12/31 08:27:04 Failed ping for 10.244.2.2, packet loss is 100.000000
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.
  • 3 node Aria Automation 8.x cluster does not have redundant network pathing


Environment

Aria Automation 8.x

Cause

3 node vPostgres clustering can breakdown due to network isolation / connectivity creating a split-brain scenario of 3 running master databases.

Resolution

Workaround:
Ensure that valid snapshots have been taken prior to performing any actions.  Do not create live snapshots.  For vRealize Automation 8.x, ensure cold powered down snapshots are performed 

It is highly encouraged to have a stringent backup procedure available on a daily schedule.

It is recommended to have redundant network pathing between ESXi hosts in which host the Aria Automation appliance nodes.

Workaround:

  1. On any of the vRealize Automation virtual appliance(s) run the following command once
    vracli cluster exec -- touch /data/db/live/debug
Note: This will create a flag file on all cluster nodes that will then pause the database pods when they start so they can then be worked with manually.
  1. Restart the postgres-1 and postgres-2 pods
    kubectl delete pod -n prelude postgres-1; kubectl delete pod -n prelude postgres-2;
Note: This will restart the postgres-1 and postgres-2 pods. Due to the debug flags, they will stop then wait instead of starting vPostgres
  1. Identify the node on which postgres-1 and postgres-2 pods are now running
    kubectl get pods -n prelude -l name=postgres -o wide
  2. On the node where postgres-1 is running execute the following command to remove the debug flags
    rm /data/db/live/debug
  3. Run
    kubectl -n prelude logs -f postgres-1
  4. Monitor the logs and ensure that postgres-1 discovers postgres-0 as a primary, re-syncs from it and starts working. A message similar to the below will be reported if successful
    '[repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state' on the postgres-1 log.
  5. Repeat the same for postgres-2
  6. Finally, remove the /data/db/live/debug file on the node where postgres-0 is running