vRealize Automation does not provide a means to remove an appliance node from an existing cluster, which may be required for business reasons or for correcting an issue.
This article provides steps to remove the vRealize Automation appliance from a cluster.
VMware vRealize Automation 7.x
Note:
- This procedure has been tested on vRealize Automation 7.3
- The following steps below can significantly impact the health of the vRealize Automation environment. It is strongly suggested to take appropriate steps to backup and snapshot your environment so that the changes can be rolled back if issues are encountered.
- It is assumed that the node to be removed is a replica and does not have the primary
postgresinstance.- If you are using vRealize Automation 7.5 skip steps 2, 3, 5, 9, and 10.
- IMPORTANT: If the environment this KB is being executed against has been hot patched with cumulative updates in 7.4 or 7.5, additional updates in PostgreSQL are required. Perform Step #4 if the environment has hot patches installed, otherwise skip this step.
To remove the node:
NODENAME variable in the following file:
/etc/rabbitmq/rabbitmq-env.conf
If the node is unavailable, the default name would be the following assuming the FQDN is node-short-domain-name:
rabbit@node-short-domain-name
Alternatively, the RabbitMQ node name is also displayed in the vRA Settings > Messaging tab of the VAMI (i.e., the web management interface found on https://vra_appliance_node_fqdn:5480) of the other healthy nodes.
SYNCHRONOUS is configured for automatic failover, before removing the clustered nodes, swap to ASYNC under the database tab.hf_execution_cmd and hf_patch_nodes tables to allow for cascading deletes:
su postgrespsql -d vcac
ALTER TABLE hf_patch_nodes DROP CONSTRAINT hf_patch_nodes_node_Id_fkey;
ALTER TABLE hf_patch_nodes ADD CONSTRAINT hf_patch_nodes_node_Id_fkey FOREIGN KEY (node_id) REFERENCES public.cluster_nodes (node_id) ON DELETE CASCADE;
ALTER TABLE hf_execution_cmd DROP CONSTRAINT hf_execution_cmd_cmd_id_fkey;
ALTER TABLE hf_execution_cmd ADD CONSTRAINT hf_execution_cmd_cmd_id_fkey FOREIGN KEY (cmd_id) REFERENCES public.cluster_commands (cmd_id) ON DELETE CASCADE;
Example: https://<vra_appliance_node_fqdn>:5480
rabbitmqctl cluster_status
Note! The node name may be in FQDN format. Ensure the correct name is used during the next step.
rabbitmqctl forget_cluster_node rabbit@node-domain-name
rabbit@node-short-domain-name will be the value extracted from the replica in step 1 above.
sed -i "/failed-node-fqdn/d" "/etc/haproxy/conf.d/10-psql.cfg" "/etc/haproxy/conf.d/20-vcac.cfg"
service haproxy restart
/usr/sbin/vcac-config cluster-config-ping-nodes --services haproxy
The value failed-node-fqdn will be the FQDN of the replica node being removed.
vsphere.local domain who has Tenant Admin permissions on each tenant. This is needed to verify step 11.echo "Delete from \"saas\".\"Connector\" where host like '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
echo "Delete from \"saas\".\"OAuth2Client\" where \"OAuth2Client\".\"redirectUri\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
echo "Delete from \"saas\".\"FederationArtifacts\" where \"FederationArtifacts\".\"strData\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"hostName\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
The value of failed-node-fqdn is the FQDN of the failed vRealize Automation appliance.
Note: Some of the above commands may print a result DELETE 0 depending on the current configuration.
service elasticsearch restart
curl -XGET 'http://localhost:9200/_nodes' executed on the current vRA primary node still returns "error" : "MasterNotDiscoveredException{waited for {30s}}, "status" : "503", run the following:
echo "Select * from \"saas\".\"ServiceInstance\" ;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
The result should not contain any records where hostName is failed-node-fqdn. For the primary node, if more than one record exists, keep only the one with the most recent createDate and delete the others using:
echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"id\" = <idNum>;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
isDirectorySyncEnabled is set to true (t):
select * from "saas"."Connector";
Note: For 3-node clusters, ensure 1 connector is set as DirectorySyncEnabled = true. If the remaining node is f, run:
echo "update \"saas\".\"Connector\" set \"isDirectorySyncEnabled\" = 't' where \"name\" = 'connector_name';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
https://<vra_appliance_node_fqdn>:5480service rabbitmq-server status.rabbitmqctl cluster_status result does not include the failed node.curl -XGET 'http://localhost:9200/_nodes'.