rabbitmq-server job fails with previous_upgrade_failed error and is unable to join the cluster

Products

VMware RabbitMQ

Issue/Introduction

A 3-node RabbitMQ cluster deployed through RabbitMQ for Tanzu VM on-demand service instance has a BOSH instance losing its persistent disk. However after the persistent disk reference is removed from BOSH and the instance is recreated with a new persistent disk, rabbitmq-server job on the new instance still fails to start and the following errors are observed in logs.

2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> BOOT FAILED
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> ===========
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> Error during startup: {error,previous_upgrade_failed}
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0>
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>   crasher:
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     initial call: application_master:init/4
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     pid: <0.222.0>
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     registered_name: []
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     exception exit: {previous_upgrade_failed,{rabbit,start,[normal,[]]}}
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>       in function  application_master:init/4 (application_master.erl, line 142)
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     ancestors: [<0.221.0>]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     message_queue_len: 1
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     messages: [{'EXIT',<0.223.0>,normal}]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     links: [<0.221.0>,<0.44.0>]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     dictionary: []
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     trap_exit: true
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     status: running
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     heap_size: 233
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     stack_size: 28
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>     reductions: 160
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>   neighbours:
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>
2023-04-18 08:53:08.182940+00:00 [notice] <0.44.0> Application rabbit exited with reason: {previous_upgrade_failed,{rabbit,start,[normal,[]]}}
2023-04-18 08:53:17.506658+00:00 [notice] <0.223.0> Logging: configured log handlers are now ACTIVE
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0> Found lock file at /var/vcap/store/rabbitmq/mnesia/db/schema_upgrade_lock.
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0>             Either previous upgrade is in progress or has failed.
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0>             Database backup path: /var/vcap/store/rabbitmq/mnesia/db-upgrade-backup

Neither upgrade-all-service-instances errand nor cf upgrade-service command could fix the issue. So the solution is to remove the failing node from the cluster and manually add it back to the cluster.

Environment

Product Version: 2.1

Resolution

The following steps can be followed to delete the node from cluster and added it back.

On any running node run rabbitmqctl forget_cluster_node <failing node name>
On failing node run rm -rf /var/vcap/store/rabbitmq/mnesia/*. This will remove the contents of the mnesia db on the failing node. Once the failing node is added back to the cluster it should sync with other running nodes on the cluster
Run bosh recreate <failing node> to get a working node without joining the cluster. monit summary should all jobs running
On failing node

monit unmonitor rabbitmq-server
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster <any of running node name>
update file /var/vcap/store/rabbitmq/mnesia/rabbit@xxxx.bosh-feature_flags to make content same as that of running nodes
rabbitmqctl start_app
monit monitor rabbitmq-server