How to remove and rejoin a faulty node in Aria Automation 8.x Cluster
search cancel

How to remove and rejoin a faulty node in Aria Automation 8.x Cluster

book

Article ID: 345933

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

How to remove and rejoin a faulty node in Aria Automation 8.x Cluster.

Environment

 VMware Aria Automation 8.x

Resolution

If it is determined that a node is faulty and we need to remove and rejoin the node in the cluster, take the following steps.

  1. In vCenter, take backup snapshots of every appliance in the VMware Aria automation HA configuration.(Non-Memory)
  2. From a root command line on any healthy node, run the following:
kubectl get pod `vracli status | jq -r '.databaseNodes[] | select(.["Role"] == "primary") | .["Node name"]' | cut -d '.' -f 1` -n prelude -o wide --no-headers=true
example:
postgres-0 1/1 Running 0 39h ##.###.#.## healthy_node-fqdn-xxx-xx.company.com <none> <none>
Important:The primary database node must be one of the healthy nodes.  If the primary database node is faulty, contact technical support instead of proceeding.
  1. From the root command line of the healthy node, remove the faulty node.
vracli cluster remove faulty-node-FQDN
  1. From the Faulty node, join the vRealize Automation cluster.
vracli cluster join primary-DB-node-FQDN
  1. Login as root to the command line of the primary database node.
  2. Deploy services on the cluster by running the following script.
/opt/scripts/deploy.sh
  1. Verify by running the command the node is joined and in "Ready" State:
kubectl get nodes

Additional Information

If the faulty node has a damaged etcd database or other Kubernetes elements, even after being removed from the cluster, then you can reset the k8s system by running this command on the faulty node:

  • vracli cluster leave

 

This can allow the faulty node to join the cluster in cases where the vracli cluster join command above hangs indefinitely (giving no output after 10-15 minutes).