VMware Aria Automation 8.x
This procedure is intended for replacing a faulty node and is not designed for reusing the same node afterward. As part of leaving the Kubernetes cluster, the following action plan will clean up local volume data on the faulty node. Most Orchestrator configuration data is stored in the database and synchronized from there; however, files that are added manually, such as Kerberos configuration files or custom database drivers, must be reapplied after the faulty node replacement.
There is no official list of files impacted when a node leaves and rejoins a cluster, but the following files are known to be affected and need to be copied manually.
If it is determined that a node is faulty and we need to remove and rejoin the node in the cluster, take the following steps:
kubectl get pod `vracli status | jq -r '.databaseNodes[] | select(.["Role"] == "primary") | .["Node name"]' | cut -d '.' -f 1` -n prelude -o wide --no-headers=true
example:postgres-0 1/1 Running 0 39h ##.###.#.## healthy_node-fqdn-xxx-xx.company.com <none> <none>
vracli cluster remove faulty-node-FQDN
vracli cluster join primary-DB-node-FQDN
/opt/scripts/deploy.sh
kubectl get nodes
If the faulty node has a damaged etcd
database or other Kubernetes elements, even after being removed from the cluster, then you can reset the k8s system by running this command on the faulty node:
vracli cluster leave
This can allow the faulty node to join the cluster in cases where the vracli cluster join
command above hangs indefinitely (giving no output after 10-15 minutes).