Resolve RabbitMQ cluster issues in vRA 8.x deployment

Products

VMware Aria Suite

Issue/Introduction

Below symptoms are noticed:

- Failed to publish event to topic: Deployment resource action requested
- Failed to publish event to topic: Deployment requested
- "Failed to publish event to topic: Deployment resource action requested" or requests do not proceed past the
- Deployment requests are stuck in different life-cycle states for a long time until a time-out is reached.
- All deployment requests start failing and restart of node(s) is necessary to bring environment back.
- Alert every 10-14 days from VROPS: Description: Aria Automation is Down. Object Name: ebs

- Below are log details from EBS app-server Logs:-
  
  The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.

computing metrics in newChannel: null
2023-11-01T09:14:03.038Z DEBUG event-broker [host='ebs-app-5c66ffc6df-cmtjb' thread='main-pool-35' user='' org='' trace='1XXXXX8-9XXX6-4XXa-bXX5- aXXXXXXXXXXXc' request-trace=''] c.v.a.e.b.s.EventBrokerConfiguration.lambda$initialize$0:123 - Operator Error: (NullPointerException) The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
java.lang.NullPointerException: The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:115)
at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
at io.opentracing.contrib.reactor.TracedSubscrib

Environment

VMware Aria Automation 8.x

Cause

Suspending vRA node or network partitioning between the vRA nodes cluster deployments, will result in connectivity issues between the RabbitMQ cluster members, which could lead to de-clustered RabbitMQ.

Resolution

There is no resolution for the issue at the moment as this depends on the *RabbitMQ* cluster resilience.
Workaround:

To work around the issue, Reset the RabbitMQ cluster:

SSH login to one of the nodes in the vRA cluster.
Check the rabbitmq-ha pods status:

root@vra-node [ ~ ]#

kubectl -n prelude get pods --selector=app=rabbitmq-ha
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 3d16h
rabbitmq-ha-1 1/1 Running 0 3d16h
rabbitmq-ha-2 1/1 Running 0 3d16h

If all rabbitmq-ha pods are healthy, check the RabbitMQ cluster status for each of them:

seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

NOTE: Analyze the command output for each RabbitMQ node and verify that the "running_nodes" list contains all cluster members from the "nodes > disc" list::
[{nodes,
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local','rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
...]

If the "running_nodes" list doesn't contain all rabbitMQ cluster members, RabbitMQ is in de-clustered state and needs to be manually reconfigured. For example:

[{nodes,
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
..]

To reconfigure the RabbitMQ cluster, complete the steps below:

SSH login to one of the vRA nodes
Reconfigure the RabbitMQ cluster: "vracli reset rabbitmq"

root@vra_node [ ~ ]# vracli reset rabbitmq
'reset rabbitmq' is a destructive command. Type 'yes' if you want to continue, or 'no' to stop: yes

Wait until all rabbitmq-ha pods are re-created and healthy: "kubectl -n prelude get pods --selector=app=rabbitmq-ha"

NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 9m53s
rabbitmq-ha-1 1/1 Running 0 9m35s
rabbitmq-ha-2 1/1 Running 0 9m14s

Delete the ebs pods: "kubectl -n prelude delete pods --selector=app=ebs-app".
Wait until all ebs pods are re-created and ready: "kubectl -n prelude get pods --selector=app=ebs-app".

NAME READY STATUS RESTARTS AGE
ebs-app-84dd59f4f4-jvbsf 1/1 Running 0 2m55s
ebs-app-84dd59f4f4-khv75 1/1 Running 0 2m55s
ebs-app-84dd59f4f4-xthfs 1/1 Running 0 2m55s

The RabbitMQ cluster is reconfigured. Request a new Deployment to verify that it completes successfully.