Resolve RabbitMQ cluster issues in vRA 8.x deployment
search cancel

Resolve RabbitMQ cluster issues in vRA 8.x deployment

book

Article ID: 319575

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  •  Below symptoms are noticed: 
    • Failed to publish event to topic: Deployment resource action requested
    • Failed to publish event to topic: Deployment completed
    • Failed to publish event to topic: Deployment requested
    • Failed to publish event to topic: Deployment resource action requested or requests do not proceed past the
    • Deployment requests are stuck in different life-cycle states for a long time until a time-out is reached.
    • All deployment requests start failing and restart of node(s) is necessary to bring environment back. 
    • When you navigate to "Extensibility" on Assembler, will see the error "  "Http failure response for https://FQDN/event-broker/api/subscriptions?page=0&size=20&%24filter=type%20eq%20%27RUNNABLE%27: 503 OK"
    • Alert every 10-14 days from VROPS: Description: Aria Automation is Down. Object Name: ebs
    • Below are log details from /var/log/services-logs/prelude/ebs-app/file-logs/ebs-app.log 

      The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
            computing metrics in newChannel: null
            [timestamp] DEBUG #####-###### [host='ebs-app-##########-#####' thread='####-####-##' user='' org='' trace='1######-######-######-####5-a#########c' #######-#####=''] c.v.a.e.b.s.EventBrokerConfiguration.lambda$initialize$0:123 -Operator Error:(NullPointerException) 
            The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
               java.lang.NullPointerException: The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
               at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:115)
    • You might see below error as well on the /var/log/services-logs/prelude/ebs-app/file-logs/ebs-app.log

      Cause: : : com.vmware.automation.spring.webflux.platform.client.service.exception.WebClientServiceResponseException: ClientResponse has erroneous status code: 500 Internal Server Error. WebClientServiceResponseException.ErrorDetails(timestamp=##-##-##, path=/#####-######/api/events, type=reactor.core.Exceptions$RetryExhaustedException, errorCode=0, messageKey=null, messageArguments=null, message=Retries exhausted: 10/10, causeMessage=null, status=500, error=Internal Server Error, exception=null, additional={requestId=########-########, cause={message=Could not open RabbitMQ connection, @type=reactor.rabbitmq.RabbitFluxException, cause={message=Connection refused, @type=connect_error}}})

 

     



Environment

  • VMware Aria Automation 8.x

Cause

  • Suspending vRA node or network partitioning between the vRA nodes cluster deployments, will result in connectivity issues between the RabbitMQ cluster members, which could lead to de-clustered RabbitMQ.
    OR
  • It may have caused by channel leak because after exceeding 2047 the channel creation function returns null.

Resolution

  • There is no resolution for the issue at the moment as this depends on the *RabbitMQ* cluster resilience.
  • Workaround:

To work around the issue, Reset the RabbitMQ cluster:

  1. SSH login to one of the nodes in the vRA cluster.
  2. Check the rabbitmq-ha pods status:
root@<hostname> [ ~ ]#
kubectl -n prelude get pods --selector=app=rabbitmq-ha
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 3d16h
rabbitmq-ha-1 1/1 Running 0 3d16h
rabbitmq-ha-2 1/1 Running 0 3d16h
  1. If all rabbitmq-ha pods are healthy, check the RabbitMQ cluster status for each of them:
seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

NOTE: Analyze the command output for each RabbitMQ node and verify that the "running_nodes" list contains all cluster members from the "nodes > disc" list::
[{nodes,
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local','rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
...]
  1. If the "running_nodes" list doesn't contain all rabbitMQ cluster members, RabbitMQ is in de-clustered state and needs to be manually reconfigured. For example:
[{nodes,
 [{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
..]

To reconfigure the RabbitMQ cluster, complete the steps below:
  1. SSH login to one of the vRA nodes
  2. Reconfigure the RabbitMQ cluster: "vracli reset rabbitmq"
root@<hostname> [ ~ ]# vracli reset rabbitmq
'reset rabbitmq' is a destructive command. Type 'yes' if you want to continue, or 'no' to stop: yes
  1. Wait until all rabbitmq-ha pods are re-created and healthy: "kubectl -n prelude get pods --selector=app=rabbitmq-ha"
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 9m53s
rabbitmq-ha-1 1/1 Running 0 9m35s
rabbitmq-ha-2 1/1 Running 0 9m14s
  1. Delete the ebs pods: "kubectl -n prelude delete pods --selector=app=ebs-app".
  2. Wait until all ebs pods are re-created and ready: "kubectl -n prelude get pods --selector=app=ebs-app".
NAME READY STATUS RESTARTS AGE
ebs-app-######-### 1/1 Running 0 2m55s
ebs-app-######-### 1/1 Running 0 2m55s
ebs-app-######-### 1/1 Running 0 2m55s
  1. The RabbitMQ cluster is reconfigured. Request a new Deployment to verify that it completes successfully.