VMware Aria Automation builds fail with ebs event and workflow run errors after a network outage
search cancel

VMware Aria Automation builds fail with ebs event and workflow run errors after a network outage

book

Article ID: 433154

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

You may observe that machine builds and deployments are failing in VMware Aria Automation. When attempting a deployment, it fails to start and returns an event broker service (EBS) event error. You will see an error similar to the following in the UI or deployment execution logs:

Failed to process ebs event: SubscriberID: vro-gateway-<UUID>, RunnableID: <RunnableID UUID> and SubscriptionID: <UUID>-sub_<UUID> failed with the following error: Workflow run [<workflow run UUID>] completed with error [Start workflow failed with client error! Possible reason is missing or invalid required parameter!]

Additionally, when reviewing the internal pod logs, you will see the following exception in the ebs-app.log:

Cause: : : com.vmware.automation.spring.webflux.platform.client.service.exception.WebClientServiceResponseException: ClientResponse has erroneous status code: 500 Internal Server Error.

Environment

VMware Aria Automation 8.18.1

Cause

This issue occurs when a network outage interrupts communication between internal Kubernetes services. The network disruption causes the event broker service (ebs-app) and related messaging queues (such as rabbitmq) to enter a stale state where they can no longer process events or trigger workflow runs correctly, resulting in HTTP 500 Internal Server Errors.

Resolution

To resolve this issue, you must restart the underlying Kubernetes and application pods to force them to re-establish healthy network connections:

  1. Log in to the VMware Aria Automation appliance via SSH.

  2. Restart the kube-system services by running the following command:

    kubectl -n kube-system delete pods --all
    
  3. Restart the prelude services by executing the deployment script:

    /opt/scripts/deploy.sh
    
  4. Monitor the pod startup process. Once all pods return to a Running state, attempt to deploy a new machine to confirm the issue is resolved.

Additional Information

The ebs-app and rabbitmq service communication errors observed during this outage are consistent with the behaviors documented in KB 319575.