Troubleshooting VMware Aria Automation and Aria Orchestrator 8.x application start issues

Products

VMware Aria Suite

Issue/Introduction

Symptoms

When starting VMware Aria Automation or Aria Automation Orchestrator 8.x with the deploy.sh script, the vco-server-app service fails to start.
Running the command kubectl get pods -n prelude shows the vco-app pod with a high restart count and a status of CrashLoopBackOff.
When describing the failing pod with kubectl describe pod <vco-app-pod-name> -n prelude, the event logs show a "Back-off restarting failed container" message.

Example output:

Environment

VMware Aria Automation Orchestrator 8.x
VMware Aria Automation 8.x

Resolution

Troubleshooting and Workarounds

This document provides a collection of troubleshooting steps and workarounds for common issues related to VMware Aria Automation Orchestrator pods failing to start.

1. Initial Troubleshooting

Before attempting manual workarounds, validate that the issue you are experiencing does not match one of these existing known issues:

2. Reinitializing Orchestrator Pods

If the pods are stuck, you can force a reinitialization using one of the following methods.

Method A: Delete Pods Individually

SSH into the Automation Orchestrator appliance.
List the pods to get their full names:
```
kubectl get pods -n prelude
```
For each vco-app pod instance, run the delete command:
```
kubectl delete pod -n prelude <vco-app-pod-name>
```
Kubernetes will automatically recreate the pods. Monitor their status until they are running.

Method B: Scale Deployments Down and Up

SSH into the Automation Orchestrator appliance.

Scale down the vco-app and UI deployments to zero replicas:

kubectl scale deployment vco-app --replicas=0 -n prelude
kubectl scale deployment orchestration-ui-app --replicas=0 -n prelude

Wait about two minutes for the pods to terminate completely.

Scale the deployments back up. Use 1 for a single-node deployment or 3 for a clustered deployment.

kubectl scale deployment vco-app --replicas=1 -n prelude
kubectl scale deployment orchestration-ui-app --replicas=1 -n prelude

3. Increasing Kubernetes Health Probe Timeouts

On older versions of Orchestrator, slow startup times can cause the pods to be terminated prematurely. You can increase the health probe timeouts to allow more time for services to initialize.

Note: These values have been improved in version 8.12.x and later. Do not apply these changes to versions 8.12.x or higher.

SSH into each Automation Orchestrator appliance in the cluster.
Use a text editor (like vi) to open the deployment configuration file:
```
vi /opt/charts/vco/templates/deployment.yaml
```
Locate the livenessProbe and readinessProbe sections for the vco-server-app container.

Modify the values for initialDelaySeconds, periodSeconds, and failureThreshold to increase the timeout period. For example:

       livenessProbe:
         failureThreshold: 20
         httpGet:
           path: /vco/api/health/liveness
           port: 8280
           scheme: HTTP
         initialDelaySeconds: 180
         periodSeconds: 30
         successThreshold: 1
         timeoutSeconds: 10
       readinessProbe:
         failureThreshold: 20
         httpGet:
           path: /vco/api/health/readiness
           port: 8280
           scheme: HTTP
         initialDelaySeconds: 180
         periodSeconds: 30
         successThreshold: 1
         timeoutSeconds: 10

After saving the file on all nodes, run the deploy script to apply the changes:
```
/opt/scripts/deploy.sh
```

4. Analyzing Java Heap Dumps (.hprof files)

Symptoms

Large *.hprof files are filling up the disk on the appliance.
After deleting the files, they are quickly regenerated, and the Orchestrator service crashes again.

Cause

A custom workflow or action may be consuming excessive Java heap memory, causing the Orchestrator service to crash and write its memory content to a .hprof file.

Procedure

Download and install VisualVM on your local machine.
Generate a heap dump on the Orchestrator appliance by following KB 94420.
Copy the generated *.hprof file from the appliance to your local machine.
Open the .hprof file in VisualVM to analyze which threads or workflows are consuming the most memory. This analysis should be performed by the workflow developer to refactor the code for better memory management.
As a temporary workaround, you can enable Safe Mode to prevent workflows from running automatically at startup. This is set under System Properties in Control Center.
- Note: In version 8.18 and later, Control Center is removed. This property must be set using vracli vro commands as described in the official documentation.
If pods continue to crash and generate heap dumps, it may be because Orchestrator is automatically retrying a failed workflow on restart. Cancel all running executions with the following command:
```
vracli vro cancel executions
```

5. Increasing Java Heap Memory (Standalone Orchestrator Only)

Note: This procedure is only applicable for standalone VMware Aria Orchestrator instances and is not supported for the embedded Orchestrator in an Aria Automation deployment.

First, increase the RAM of the Orchestrator virtual machine(s) in vCenter.
Log in to the Orchestrator appliance command line over SSH as root.

Run the following script to create a custom profile and resource definition file. This command is a single line.

vracli cluster exec -- bash -c 'base64 -d <<< IyBDcmVhdGUgY3VzdG9tIHByb2ZpbGUgZGlyZWN0b3J5Cm1rZGlyIC1wIC9ldGMvdm13YXJlLXByZWx1ZGUvcHJvZmlsZXMvY3VzdG9tLXByb2ZpbGUvCgojIENyZWF0ZSB0aGUgcmVxdWlyZWQgZGlyZWN0b3J5IHRyZWUgdGhhdCB3aWxsIGJlIHVzZWQgd2hlbiB0aGUgcHJvZmlsZSBpcyBhY3RpdmUKbWtkaXIgLXAgL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLwoKIyBDcmVhdGUgImNoZWNrIiBmaWxlIHRoYXQgaXMgYW4gZXhlY3V0YWJsZSBmaWxlIHJ1biBieSBkZXBsb3kgc2NyaXB0LgpjYXQgPDxFT0YgPiAvZXRjL3Ztd2FyZS1wcmVsdWRlL3Byb2ZpbGVzL2N1c3RvbS1wcm9maWxlL2NoZWNrCiMhL2Jpbi9iYXNoCmV4aXQgMApFT0YKY2htb2QgNzU1IC9ldGMvdm13YXJlLXByZWx1ZGUvcHJvZmlsZXMvY3VzdG9tLXByb2ZpbGUvY2hlY2sKCiMgQ29weSB2Uk8gcmVzb3VyY2UgbWV0cmljcyBmaWxlIHRvIHlvdXIgY3VzdG9tIHByb2ZpbGUKY2F0IDw8RU9GID4gL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLzkwLXJlc291cmNlcy55YW1sCnBvbHlnbG90UnVubmVyTWVtb3J5TGltaXQ6IDYwMDBNCnBvbHlnbG90UnVubmVyTWVtb3J5UmVxdWVzdDogMTAwME0KcG9seWdsb3RSdW5uZXJNZW1vcnlMaW1pdFZjbzogNTYwME0KCnNlcnZlck1lbW9yeUxpbWl0OiA2RwpzZXJ2ZXJNZW1vcnlSZXF1ZXN0OiA1RwpzZXJ2ZXJKdm1IZWFwTWF4OiA0RwoKY29udHJvbENlbnRlck1lbW9yeUxpbWl0OiAxLjVHCmNvbnRyb2xDZW50ZXJNZW1vcnlSZXF1ZXN0OiA3MDBtCkVPRgpjaG1vZCA2NDQgL2V0Yy92bXdhcmUtcHJlbHVkZS9wcm9maWxlcy9jdXN0b20tcHJvZmlsZS9oZWxtL3ByZWx1ZGVfdmNvLzkwLXJlc291cmNlcy55YW1sCg== | bash'

Edit the newly created resource metrics file:

vi /etc/vmware-prelude/profiles/custom-profile/helm/prelude_vco/90-resources.yaml

Modify the memory and heap values as needed. For example, to increase the heap to 7G, you would set the following values (this requires a corresponding increase in the VM's total RAM):
```
serverMemoryLimit: 9G
serverMemoryRequest: 8G
serverJvmHeapMax: 7G
```
Save the changes and apply them by running the deploy script:
```
/opt/scripts/deploy.sh
```

Additional Information

Impact/Risks:
VMware Aria Automation or Automation Orchestrator fails to properly boot. Workflows will fail to run until this is resolved.