One Aria Automation node from a 3 node cluster is down/unavailable and Provisioning is not functioning
search cancel

One Aria Automation node from a 3 node cluster is down/unavailable and Provisioning is not functioning

book

Article ID: 377795

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

  • one Aria Automation node is down / unavailable due to Infrastructure issues
  • Aria Automation portal is accessible
  • VM provisioning is taking a long time and eventually failing with errors about Event topics e.g.: "Failed to publish event to topic: Deployment requested"
  • reviewing Aria Automation services using command "kubectl -n prelude get pods -o wide" only 1 pods from one node are down
  • reviewing RabbitMQ status using below command, only 1 node shows as active node (Ref: Resolve RabbitMQ cluster issues in vRA 8.x deployment)
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

Environment

Aria Automation 8.x

Cause

Due to RabbitMQ isolation, another RabbitMQ service was stopped, therefor the last working RabbitMQ service stopped handling any messages.

Resolution

Before proceeding please take a Snapshot, including Memory, of the 2 available nodes from vCenter.

  1. Identify current running RabbitMQ nodes:
    kubectl -n prelude get pods -o wide | grep -Ei "name|rabbitmq"
  2. Identify which pods are currently running the RabbitMQ application:
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

    E.g.:



    Only "rabbitmq-ha-0" is active, depending to which node is down "rabbitmq-ha-1" or "rabbitmq-ha-2" the opposite has to be started
  3. Try to start the RabbitMQ application on the node which is available but not listed as "Running Nodes" from above command.

    E.g.: Node 3 of the Aria Automation cluster has the outage, Node 2 is running but RabbitMQ not reporting
    kubectl exec -n prelude rabbitmq-ha-1 -- bash -c "rabbitmqctl start_app"
  4. Validate that now 2 nodes reporting as running using the same command as Step 2:
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
  5. Validate provisioning is now proceeding by creating a new Request in Aria Automation portal