Resolving RabbitMQ Cluster Join Failure After Replacing Node with a New Host

Products

VMware Tanzu RabbitMQ VMware RabbitMQ Pivotal RabbitMQ

Issue/Introduction

In some scenarios, replacing a Docker-based RabbitMQ node with a new host-based node of the same name can result in the failure of the node joining the cluster. Despite performing typical node reset and join procedures, the join operation may fail initially, even though the issue can eventually be resolved by repeated actions.

Cause

The issue occurs due to several factors:

Erlang Cookie Mismatch: The .erlang.cookie file, which is critical for node communication in a RabbitMQ cluster, may not have synchronized correctly between the old and new nodes, causing the join operation to fail.
DNS Resolution: While the DNS is updated to resolve the old node's name (e.g., Node A) to the new host IP, there might be timing or propagation issues, which prevent proper DNS resolution at the time of node join.
Cluster Syncing Delay: After resetting the node and attempting to join the cluster, there can be delays or issues with the synchronization of state between the nodes, especially if certain configurations are cached or if the cluster is not fully stable.
RabbitMQ Service Initialization: The RabbitMQ service may not be fully initialized or properly configured to allow for a clean reset and join operation, requiring multiple restart attempts.

Resolution

To resolve the issue of node joining the cluster, follow these steps:

Stop RabbitMQ on the New Node:

sudo systemctl stop rabbitmq-server
Reset the RabbitMQ Node: Reset the new node to ensure it starts fresh.

sudo rabbitmqctl reset
Force Reset (if necessary): If the reset does not work, attempt a forced reset to ensure the node is completely reset.

sudo rabbitmqctl force_reset
Ensure Erlang Cookie Sync: Ensure that the .erlang.cookie file is consistent across all nodes. Copy the .erlang.cookie from a working node (e.g., Node B) to the new node (Node A) if necessary:

sudo cp /var/lib/rabbitmq/.erlang.cookie /path/to/new/node/.erlang.cookie sudo chown rabbitmq:rabbitmq /path/to/new/node/.erlang.cookie
Enable and Start RabbitMQ on the New Node: Make sure RabbitMQ is enabled to start automatically on boot, and start the service:

sudo systemctl enable --now rabbitmq-server sudo rabbitmqctl stop_app
Join the Node to the Cluster: Attempt to join the node to the cluster after ensuring the Erlang cookie is set correctly:

sudo rabbitmqctl join_cluster rabbit@<Cluster Node Name>

Replace <Cluster Node Name> with the hostname or IP of an existing cluster node (e.g., Node B).
Restart the Node: Reboot the new node to ensure the RabbitMQ service is fully initialized.

sudo reboot
Check Node Status: Verify the status of the node and ensure it is now part of the cluster.

sudo rabbitmqctl status sudo rabbitmqctl cluster_status
Start RabbitMQ on the New Node: Once the node successfully joins the cluster, start the RabbitMQ service.

sudo rabbitmqctl start_app

Repeat the above steps as needed, especially the reset and join operations, until the node successfully joins the cluster.