RabbitMQ cluster issue causes deployment failure in Aria Automation

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Below symptoms are noticed:

Failed to publish event to topic: Deployment resource action requested
Failed to publish event to topic: Deployment completed
Failed to publish event to topic: Deployment requested
Deployment requests are stuck in different life-cycle states for a long time until a time-out is reached.
All deployment requests start failing and restart of node(s) is necessary to bring environment back.
When you navigate to "Extensibility" on Assembler, will see the error " "Http failure response for https://FQDN/event-broker/api/subscriptions?page=0&size=20&%24filter=type%20eq%20%27RUNNABLE%27: 503 OK"
Alert every 10-14 days from Aria Operations: Description: Aria Automation is Down. Object Name: ebs

Below are log details from /var/log/services-logs/prelude/ebs-app/file-logs/ebs-app.log

The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
      computing metrics in newChannel: null
      [timestamp] DEBUG #####-###### [host='ebs-app-##########-#####' thread='####-####-##' user='' org='' trace='1######-######-######-####5-a#########c' #######-#####=''] c.v.a.e.b.s.EventBrokerConfiguration.lambda$initialize$0:123 -Operator Error:(NullPointerException) 
      The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
         java.lang.NullPointerException: The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
         at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:115)

You will see below error as well on the /var/log/services-logs/prelude/ebs-app/file-logs/ebs-app.log

Cause: : : com.vmware.automation.spring.webflux.platform.client.service.exception.WebClientServiceResponseException: ClientResponse has erroneous status code: 500 Internal Server Error. WebClientServiceResponseException.ErrorDetails(timestamp=##-##-##, path=/#####-######/api/events, type=reactor.core.Exceptions$RetryExhaustedException, errorCode=0, messageKey=null, messageArguments=null, message=Retries exhausted: 10/10, causeMessage=null, status=500, error=Internal Server Error, exception=null, additional={requestId=########-########, cause={message=Could not open RabbitMQ connection, @type=reactor.rabbitmq.RabbitFluxException, cause={message=Connection refused, @type=connect_error}}})

Environment

VMware Aria Automation 8.18.x

Cause

Suspending vRA node or network partitioning between the vRA nodes cluster deployments, will result in connectivity issues between the RabbitMQ cluster members, which could lead to de-clustered RabbitMQ.
- OR
It may have caused by channel leak because after exceeding 2047 the channel creation function returns null.

Resolution

There is no resolution for the issue at the moment as this depends on the *RabbitMQ* cluster resilience.

Workaround:

To reset the RabbitMQ cluster:

SSH login to one of the nodes in the vRA cluster.

Check the rabbitmq-ha pods status:

root@<hostname> [ ~ ]#
kubectl -n prelude get pods --selector=app=rabbitmq-ha
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 3d16h
rabbitmq-ha-1 1/1 Running 0 3d16h
rabbitmq-ha-2 1/1 Running 0 3d16h

If all rabbitmq-ha pods are healthy, check the RabbitMQ cluster status for each of them:
seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
NOTE: Analyze the command output for each RabbitMQ node and verify that the "running_nodes" list contains all cluster members from the "nodes > disc" list:

[{nodes,
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local','rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
...]

If the "running_nodes" list doesn't contain all rabbitMQ cluster members, RabbitMQ is in de-clustered state and needs to be manually reconfigured. For example:

To reconfigure the RabbitMQ cluster, complete the steps below:

SSH login to one of the Aria Automation nodes.

Reconfigure the RabbitMQ cluster: "vracli reset rabbitmq"

root@<hostname> [ ~ ]# vracli reset rabbitmq
'reset rabbitmq' is a destructive command. Type 'yes' if you want to continue, or 'no' to stop: yes

Wait until all rabbitmq-ha pods are re-created and healthy: "kubectl -n prelude get pods --selector=app=rabbitmq-ha"

NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 9m53s
rabbitmq-ha-1 1/1 Running 0 9m35s
rabbitmq-ha-2 1/1 Running 0 9m14s

Delete the ebs pods: "kubectl -n prelude delete pods --selector=app=ebs-app".

Wait until all ebs pods are re-created and ready: "kubectl -n prelude get pods --selector=app=ebs-app".

NAME READY STATUS RESTARTS AGE
ebs-app-######-### 1/1 Running 0 2m55s
ebs-app-######-### 1/1 Running 0 2m55s
ebs-app-######-### 1/1 Running 0 2m55s

The RabbitMQ cluster is reconfigured. Request a new Deployment to verify that it completes successfully.

Additional Information

In some cases, performing the above action will resolve the RabbitMQ communication issue; however, data for the cloud zones under the project may still not be visible.
To fully resolve the issue, redeploy the pods using the following command:

/opt/scripts/deploy.sh