Troubleshooting vRealize Automation 7.x Manager Service Failover

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

Deployments are stuck at 99% with no progress after hours.

Environment

VMware vRealize Automation 7.x

Cause

Known issues around CNAME usage vs load balancing etc. See load balancing best practices here: Load balancing requirements for VMware vRealize Automation (multiple versions)

Resolution

Numerous situations can lead to a degraded Manager service failover state.

Troubleshooting:

Enable Automatic Manager Service Failover

Automatic Manager service failover is turned on by default if you've installed the environment in an HA config and not if you've extended it post install with the Windows installer, this may dictate what configuration we later, see above.

Could you RDP into the IaaS Manager Service nodes, I assume there is 2, and confirm the current failover configuration by reviewing the status of the vCloud Automation Center Service and validate what the Startup Type is set to? If we are configured to allow for automatic service failover, both services should be running and set to Automatic. If we are configured with Manager service failover disabled, one of the nodes will have a Startup Type of Manual.
When reviewing the log file for the Manager Service, "C:\Program Files (x86)\VMware\vCAC\Server\Logs\All.log", we expect to ONLY see 1 manager node responding to ping reports similar to the screen shot below:
We then want to trace the your DNS and FQDNs + the load balancer entries. We expect to have 3 DNS entries in the environment to support automatic failover:

VIP A Record in DNS pointing to virtual server in your network load balancer (NSX / F5 / Netscaler etc)
Node 01 A record in DNS pointing to 1st of the Windows Servers hosting the Manager service component
Node 02 A record in DNS pointing to 2nd of the Windows servers hosting the Manager service component

Once we have this information we can leverage the below health checks to determine the status of the virtual machine provisioning service URLs VMPS that deliver and process workitems for the other IaaS components:

- https://FQDN-IaaS-Manager/VMPSProvision
- https://FQDN-IaaS-Manager/VMPS2

Note: There are some additional failover parameters and URLs located within the Manager.Service.exe.config file located under C:\Program Files (x86)\VMware\vCAC\Server\ or in whatever installation folder the component was directed to install to which can be leveraged in reference for the URLs we test with above.

Using these health URLs formatted for each DNS entry in your environment, you can test their response in your browser or cURL and/or openssl s_client -connect (performed from vRA appliance or you must download the openssl utility to the Windows nodes)
When testing we expect only one node behind the VIP to be hosting the website for the manager. If both web services start up, you will begin to see intermittent and sporadic behavior in provisioning a day 2 operations (most likely what you are seeing now).

With this information we will most likely need to investigate the load balancer configuration should the issue not be identified within steps #1-5.