How to remove a vRealize Automation appliance from a cluster
search cancel

How to remove a vRealize Automation appliance from a cluster

book

Article ID: 343888

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

vRealize Automation does not provide a means to remove an appliance node from an existing cluster, which may be required for business reasons or for correcting an issue.

This article provides steps to remove the vRealize Automation appliance from a cluster.

Environment

VMware vRealize Automation 7.x

Resolution

Note:

  • This procedure has been tested on vRealize Automation 7.3
  • The following steps below can significantly impact the health of the vRealize Automation environment. It is strongly suggested to take appropriate steps to backup and snapshot your environment so that the changes can be rolled back if issues are encountered.
  • It is assumed that the node to be removed is a replica and does not have the primary postgres instance.
  • If you are using vRealize Automation 7.5 skip steps 2, 3, 5, 9, and 10.
  • IMPORTANT: If the environment this KB is being executed against has been hot patched with cumulative updates in 7.4 or 7.5, additional updates in PostgreSQL are required. Perform Step #4 if the environment has hot patches installed, otherwise skip this step.

To remove the node:

  1. Go to all directories in each tenant and verify whose connector does not point to the failing node. Change if necessary.
  2. Connect through SSH or the console to the replica node and extract the name of the node from RabbitMQ under the NODENAME variable in the following file:
    /etc/rabbitmq/rabbitmq-env.conf

    If the node is unavailable, the default name would be the following assuming the FQDN is node-short-domain-name:

    rabbit@node-short-domain-name

    Alternatively, the RabbitMQ node name is also displayed in the vRA Settings > Messaging tab of the VAMI (i.e., the web management interface found on https://vra_appliance_node_fqdn:5480) of the other healthy nodes.

  3. If SYNCHRONOUS is configured for automatic failover, before removing the clustered nodes, swap to ASYNC under the database tab.
  4. Modify hf_execution_cmd and hf_patch_nodes tables to allow for cascading deletes:
    1. SSH into primary appliance
    2. su postgres
    3. psql -d vcac
      ALTER TABLE hf_patch_nodes DROP CONSTRAINT hf_patch_nodes_node_Id_fkey;
      ALTER TABLE hf_patch_nodes ADD CONSTRAINT hf_patch_nodes_node_Id_fkey FOREIGN KEY (node_id) REFERENCES public.cluster_nodes (node_id) ON DELETE CASCADE;
      ALTER TABLE hf_execution_cmd DROP CONSTRAINT hf_execution_cmd_cmd_id_fkey;
      ALTER TABLE hf_execution_cmd ADD CONSTRAINT hf_execution_cmd_cmd_id_fkey FOREIGN KEY (cmd_id) REFERENCES public.cluster_commands (cmd_id) ON DELETE CASCADE;
  5. Power down the Replica node.
  6. Log into the primary vRealize Automation Virtual Appliance Management Interface (VAMI)

    Example: https://<vra_appliance_node_fqdn>:5480

  7. Navigate to vRA Settings > Cluster tab.
  8. Remove the failing node from the cluster using the Delete button.
  9. In an SSH or console session on the primary vRA node, run the command to capture the registered node name:
    rabbitmqctl cluster_status

    Note! The node name may be in FQDN format. Ensure the correct name is used during the next step.

    rabbitmqctl forget_cluster_node rabbit@node-domain-name

    rabbit@node-short-domain-name will be the value extracted from the replica in step 1 above.

  10. In an SSH or console session on the primary vRealize Automation appliance, run the commands:
    sed -i "/failed-node-fqdn/d" "/etc/haproxy/conf.d/10-psql.cfg" "/etc/haproxy/conf.d/20-vcac.cfg"
    service haproxy restart
    /usr/sbin/vcac-config cluster-config-ping-nodes --services haproxy

    The value failed-node-fqdn will be the FQDN of the replica node being removed.

  11. Log in to the vRA UI with a user from the vsphere.local domain who has Tenant Admin permissions on each tenant. This is needed to verify step 11.
  12. If there were directories whose connector was pointing to the failing node (were not updated during step 1), they need to be deleted and re-created.
  13. In an SSH or console session on the primary vRealize Automation appliance, run these commands:
    echo "Delete from \"saas\".\"Connector\" where host like '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
    echo "Delete from \"saas\".\"OAuth2Client\" where \"OAuth2Client\".\"redirectUri\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
    echo "Delete from \"saas\".\"FederationArtifacts\" where \"FederationArtifacts\".\"strData\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
    echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"hostName\" LIKE '%failed-node-fqdn%';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

    The value of failed-node-fqdn is the FQDN of the failed vRealize Automation appliance.

    Note: Some of the above commands may print a result DELETE 0 depending on the current configuration.

  14. In an SSH or console session on the primary vRealize Automation appliance, run the command:
    service elasticsearch restart
  15. If curl -XGET 'http://localhost:9200/_nodes' executed on the current vRA primary node still returns "error" : "MasterNotDiscoveredException{waited for {30s}}, "status" : "503", run the following:
    echo "Select * from \"saas\".\"ServiceInstance\" ;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

    The result should not contain any records where hostName is failed-node-fqdn. For the primary node, if more than one record exists, keep only the one with the most recent createDate and delete the others using:

    echo "Delete from \"saas\".\"ServiceInstance\" where \"ServiceInstance\".\"id\" = <idNum>;" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
  16. Validate that the deleted node is not the primary synchronization connector by confirming the value of isDirectorySyncEnabled is set to true (t):
    select * from "saas"."Connector";

    Note: For 3-node clusters, ensure 1 connector is set as DirectorySyncEnabled = true. If the remaining node is f, run:

    echo "update \"saas\".\"Connector\" set \"isDirectorySyncEnabled\" = 't' where \"name\" = 'connector_name';" | su - postgres /opt/vmware/vpostgres/current/bin/psql vcac
  17. (Optional) If you are using embedded vRO (version $\ge$ 7.1), access the advanced Orchestrator Cluster Management page in Control Center to remove leftover records.

System Health Verification

  1. Verify all services in VAMI are registered: https://<vra_appliance_node_fqdn>:5480
  2. Verify RabbitMQ service status: service rabbitmq-server status.
  3. Verify rabbitmqctl cluster_status result does not include the failed node.
  4. Verify Elasticsearch cluster status: curl -XGET 'http://localhost:9200/_nodes'.