vRO workflows fail with RemoteDisconnected error in VMware Aria Automation

search cancel

vRO workflows fail with RemoteDisconnected error in VMware Aria Automation

book

Article ID: 439418

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

In a multi-node, highly available VMware Aria Automation environment fronted by a load balancer, you may experience intermittent failures when making calls to the vRA API. Specifically, vRealize Orchestrator (vRO) workflows running Python scripts that call the IaaS API may fail and report the following error within the logs or client:

Unable to pull project information...
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
ebs.error.code=10040

Additionally, you may observe the load balancer health monitor intermittently flapping or marking a specific node as DOWN for brief periods (e.g., 30 seconds to 8 minutes).

Environment

VMware Aria Automation 8.18.1 (Multi-node HA cluster)
External Load Balancer (e.g., Citrix NetScaler)

Cause

This issue occurs when the provisioning-service-app pod on a specific node begins consuming excessive CPU cycles. This localized CPU spike starves the health-reporting-app pod of resources, causing latency on the port 8008 /health endpoint to drift past the load balancer's configured response timeout (typically 6 seconds).

While the load balancer accumulates failed retries (the detection gap), it continues to route API traffic to the unhealthy node. The node accepts the TCP connection at the SSL bridge layer but closes it without responding, which triggers the RemoteDisconnected error in the calling workflow or client.

Resolution

To resolve the CPU starvation and stabilize the affected node, you must force Kubernetes to recreate the provisioning pod. You should also adjust your load balancer and workflow configurations to prevent future connection drops.

Identify the unhealthy node and locate the name of the provisioning-service-app pod running on it.
SSH into your VMware Aria Automation appliance and run the following command to delete the affected pod (Kubernetes will automatically recreate it, clearing any hung processes):
```
kubectl delete pod -n prelude <provisioning-service-pod-name>
```
Access your external load balancer configuration.
Increase the response timeout (resptimeout) on your VMware Aria Automation health monitor from 6 seconds to 15 seconds to avoid false negatives during heavy service load.
Update the custom Python code within your vRO workflows to include basic retry logic (e.g., 1 to 2 retries) for connection errors. This ensures the workflow will transparently failover to a healthy node if a single connection is dropped.

Additional Information

For an established example of resolving stuck processes by restarting the provisioning pod, refer to Broadcom KB 326017.
VMware Aria Automation Health API fails with 500 error. — Guidance on increasing timeouts for the Aria Automation health API when response times exceed 10 seconds.
How to obtain the VRA services status via API — Details on using the port 8008 health endpoint for cluster monitoring.

Feedback

thumb_up Yes

thumb_down No