How to remove and rejoin a faulty node in Aria Automation 8.x Cluster

Products

VMware Aria Suite

Issue/Introduction

How to remove and rejoin a faulty node in Aria Automation 8.x Cluster.

Environment

VMware Aria Automation 8.x

Resolution

Note:

This procedure is intended for replacing a faulty node and is not designed for reusing the same node afterward. As part of leaving the Kubernetes cluster, the following action plan will clean up local volume data on the faulty node. Most Orchestrator configuration data is stored in the database and synchronized from there; however, files that are added manually, such as Kerberos configuration files or custom database drivers, must be reapplied after the faulty node replacement.

There is no official list of files impacted when a node leaves and rejoins a cluster, but the following files are known to be affected and need to be copied manually.

Kerberos configurations - the configuration file listed here is impacted https://techdocs.broadcom.com/us/en/vmware-cis/aria/aria-automation/8-18/vro-using-plug-ins-8-18/manage-the-orchestrator-plug-ins/configure-kerberos-authentication.html
Custom database drivers - deployment of specific database drivers in locations as specified here https://techdocs.broadcom.com/us/en/vmware-cis/aria/aria-automation/8-18/vro-using-plug-ins-8-18/using-the-sql-plug-in/adding-a-mysql-connector-jar-file-to-vrealize-orchestrator.html
Updates to javascript file access - changes to the javascript file system access might be lost and reverted to defaults https://techdocs.broadcom.com/us/en/vmware-cis/aria/aria-automation/8-18/vco-installing-and-configuring-8-18/setting-system-properties/setting-server-file-system-access-from-workflows-and-javascript/set-server-file-system-access-for-workflows.html

Procedure:

If it is determined that a node is faulty and we need to remove and rejoin the node in the cluster, take the following steps:

In vCenter, take backup snapshots of every appliance in the VMware Aria automation HA configuration.(Non-Memory)
From a root command line on any healthy node, run the following:

kubectl get pod `vracli status | jq -r '.databaseNodes[] | select(.["Role"] == "primary") | .["Node name"]' | cut -d '.' -f 1` -n prelude -o wide --no-headers=true

example:
postgres-0 1/1 Running 0 39h ##.###.#.## healthy_node-fqdn-xxx-xx.company.com <none> <none>

Important:The primary database node must be one of the healthy nodes. If the primary database node is faulty, contact technical support instead of proceeding.

From the root command line of the healthy node, remove the faulty node.

vracli cluster remove faulty-node-FQDN

From the Faulty node, join the vRealize Automation cluster.

vracli cluster join primary-DB-node-FQDN

Login as root to the command line of the primary database node.
Deploy services on the cluster by running the following script.

/opt/scripts/deploy.sh

Verify by running the command the node is joined and in "Ready" State:

kubectl get nodes

Additional Information

If the faulty node has a damaged etcd database or other Kubernetes elements, even after being removed from the cluster, then you can reset the k8s system by running this command on the faulty node:

vracli cluster leave

This can allow the faulty node to join the cluster in cases where the vracli cluster join command above hangs indefinitely (giving no output after 10-15 minutes).