Troubleshooting VMware Aria Automation Orchestrator 8.x application start issues

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

You are attempting to start VMware Aria Automation or Aria Automation Orchestrator 8.x with deploy.sh but the vco-server-app fails to start
When running kubectl get pods -n prelude you see the vco-app with a large number of restarts and a STATUS of 'CrashLoopBackOff'.
When running kubectl describe pod vco-app<UUID-of-pod>, returns with message similar to below

Example output:

Environment

VMware Aria Automation Orchestrator 8.x
VMware Aria Automation 8.x

Resolution

See the Workaround below for additional information.

Workaround:

Troubleshooting Steps:

Validate the issue does not match to existing known issues:
Validate the health of the environment:
- Validating VMware Aria Automation and Automation Orchestrator 8.x health
Troubleshoot Common Issues:
- Troubleshooting VMware Aria Automation cloud proxies and On-Premises appliance deployments

Procedure: Reinitialize each vco-app pod using Kubernetes Delete command

SSH into the Automation Orchestrator 8.x appliance
Verify pods using kubectl get pods -n prelude
Run kubectl delete pod -n prelude vco-app<UUID> for each pod instance
Wait for pod to rebuild

Procedure: Reinitialize each vco-app pod using Kubernetes SCALE/UP commands

SSH into the Automation Orchestrator 8.x appliance
Verify how many vco app instances are running from kubectl get pods -n prelude
Run commands to scale down replicas to Zero :

kubectl scale deployment orchestration-ui-app --replicas=0 -n prelude

kubectl scale deployment vco-app --replicas=0 -n prelude sleep 120
Run commands to scale up Replicas based on Single deployment (1) or Clustered Deployment (3):

kubectl scale deployments orchestration-ui-app --replicas=1 -n prelude

kubectl scale deployment vco-app --replicas=1 -n prelude

Procedure: Increasing Kubernetes health probe timeouts

The default is 10 seconds of earlier versions of VMware Aria Automation Orchestrator and can be increased to 30 seconds or higher.

Set the values for initialDelaySeconds, periodSeconds, and failureThreshold similar to:

initialDelaySeconds: 180
timeoutSeconds: 10
periodSeconds: 30
successThreshold: 1
failureThreshold: 20

Note: Improvements to these values have been introduced in 8.12.x. Do not implement these instructions on versions higher than 8.12.x and above.

SSH into the Automation Orchestrator 8.x appliance.
Use vi or vim to edit /opt/charts/vco/templates/deployment.yaml on each node in the cluster

Edit the section for the vco-server-app container liveness and readiness probes, an example is below:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /vco/api/health/liveness
            port: 8280
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        name: vco-server-app
        ports:
        - containerPort: 8280
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /vco/api/health/readiness
            port: 8280
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 30

Attempt to restart services by running:
```
/opt/scripts/deploy.sh
```

Note: To decrease the load on these timeouts:

Ensure all web client tabs connecting to Automation Orchestrator are closed.
If running on a single node, scale to a 3-node cluster.
Upgrade to the latest version as this behavior is improved upon.

Analyzing *.hprof files (Java heap dumps from Automation Orchestrator)

Symptoms

*.hprof files fill up a large amount of disk space.
After deleting the files filling up the disk, a new *.hprof file is generated and the process repeats before Automation Orchestrator services are able to be logged into by a user.

Cause

Custom workflows and actions maybe consuming too much java heap for the application to keep up, causing memory to be written to disk in *.hprof format, crashing the Orchestrator service.

Workaround

Contact your workflow Developer. The following instructions are considered a development task when writing workflows for Automation Orchestrator.

Download VisualVM to a system external to the Automation Orchestrator appliance.
Extract the zip file.
Generate a heap dump by following instructions found in Error! 500 when attempting to generate a heap dump in Aria Orchestrator control center interface.
Copy the *.hprof file to a location accessible by this external system.
Start VisualVM. The executable is located in the ./bin directory.
In the VisualVM explorer window, right-click on the Heap Dumps node and choose Load Heap Dump.
Navigate to the location of your *.hprof file, select it, and click Open.
Once the heap dump is loaded, it will appear as a node under Heap Dumps. Click on it to analyze the heap dump.
1. Triage the issue by isolating threads or workflows consuming a large amount of memory.
  1. Refactor your code to be considerate of Java heap.
2. If the issue persists, try enabling Safe Mode by setting ch.dunes.safe-mode = true in Control Center under System Properties.
  Note: In 8.18 and later, the Control Center has been removed and the property will need to be set using "vracli vro" commands as per the documentation: Additional command line interface configuration options
  1. Monitor for the service to restart then try accessing Automation Orchestrator again.

NOTE: If the pods keep going into CrashLoopBAckOff and generating heap dumps, then it is likely that Orchestrator is automatically re-trying the failed workflow when it restarts. In this situation, you will need to cancel all executions:

Run this command on one node:
```
vracli vro cancel executions
```
Then remove any new hprof files and restart the pods if necessary.

Increase Java Heap (standalone vRO only)

Scaling the heap memory of the vRealize Orchestrator Appliance is only applicable for standalone vRealize Orchestrator instances and is not supported for embedded vRealize Orchestrator instances in vRealize Automation.
Increase the RAM of the virtual machine on which vRealize Orchestrator is deployed up to the next suitable increment. Because it is important that enough memory is left available for the rest of the services, the vRealize Orchestrator Appliance resources must be scaled up first. For example, If the desired heap memory is 7G then the vRealize Orchestrator Appliance RAM should be increased with 4G respectively because the subtraction between the default heap value of 3G and the desired heap memory is 4G.

Log in the vRealize Orchestrator Appliance command line over SSH as root.
To create the custom profile directory and the required directory tree that is used when the profile is active, run the following script:

vracli cluster exec -- bash -c 'base64 -d <<< IyBDcmVhdGUgY3VzdG9tIHByb2ZpbGUgZGlyZWN0b3J5Cm1rZGlyIC1wIC9ldGMvdm13YXJlLXByZWx1ZGUvcHJvZmlsZXMvY3VzdG9tLXByb2ZpbGUvCgojIENyZWF0ZSB0aGUgcmVxdWlyZWQgZGlyZWN0b3J5IHRyZWUgdGhhdCB3aWxsIGJlIHVzZWQgd2hlbiB0aGUgcHJvZmlsZSBpcyBhY3RpdmUKbWtkaXIgLXAgL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLwoKIyBDcmVhdGUgImNoZWNrIiBmaWxlIHRoYXQgaXMgYW4gZXhlY3V0YWJsZSBmaWxlIHJ1biBieSBkZXBsb3kgc2NyaXB0LgpjYXQgPDxFT0YgPiAvZXRjL3Ztd2FyZS1wcmVsdWRlL3Byb2ZpbGVzL2N1c3RvbS1wcm9maWxlL2NoZWNrCiMhL2Jpbi9iYXNoCmV4aXQgMApFT0YKY2htb2QgNzU1IC9ldGMvdm13YXJlLXByZWx1ZGUvcHJvZmlsZXMvY3VzdG9tLXByb2ZpbGUvY2hlY2sKCiMgQ29weSB2Uk8gcmVzb3VyY2UgbWV0cmljcyBmaWxlIHRvIHlvdXIgY3VzdG9tIHByb2ZpbGUKY2F0IDw8RU9GID4gL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLzkwLXJlc291cmNlcy55YW1sCnBvbHlnbG90UnVubmVyTWVtb3J5TGltaXQ6IDYwMDBNCnBvbHlnbG90UnVubmVyTWVtb3J5UmVxdWVzdDogMTAwME0KcG9seWdsb3RSdW5uZXJNZW1vcnlMaW1pdFZjbzogNTYwME0KCnNlcnZlck1lbW9yeUxpbWl0OiA2RwpzZXJ2ZXJNZW1vcnlSZXF1ZXN0OiA1RwpzZXJ2ZXJKdm1IZWFwTWF4OiA0RwoKY29udHJvbENlbnRlck1lbW9yeUxpbWl0OiAxLjVHCmNvbnRyb2xDZW50ZXJNZW1vcnlSZXF1ZXN0OiA3MDBtCkVPRgpjaG1vZCA2NDQgL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLzkwLXJlc291cmNlcy55YW1sCg== | bash'
Edit the resource metrics file in your custom profile with the desired memory values.

vi /etc/vmware-prelude/profiles/custom-profile/helm/prelude_vco/90-resources.yaml
The 90-resources.yaml file should contain the following default properties:

polyglotRunnerMemoryRequest: 1000M

polyglotRunnerMemoryLimit: 6000M

polyglotRunnerMemoryLimitVco: 5600M

serverMemoryLimit: 6G

serverMemoryRequest: 5G

serverJvmHeapMax: 4G

controlCenterMemoryLimit: 1.5G

controlCenterMemoryRequest: 700m
Modify the 90-resources.yaml with the following properties (if cluster, all 3 nodes):

polyglotRunnerMemoryRequest: 1000M

polyglotRunnerMemoryLimit: 7000M

polyglotRunnerMemoryLimitVco: 6700M

serverMemoryLimit: 9G

serverMemoryRequest: 8G

serverJvmHeapMax: 7G

controlCenterMemoryLimit: 1.5G

controlCenterMemoryRequest: 700m
Save the changes to the resource metrics file and run the deploy.sh script.

/opt/scripts/deploy.sh

Additional Information

Impact/Risks:
VMware Aria Automation or Automation Orchestrator fails to properly boot. Workflows will fail to run until this is resolved.