A 3-node RabbitMQ cluster deployed through
RabbitMQ for Tanzu VM on-demand service instance has a BOSH instance losing its persistent disk. However after the persistent disk reference is removed from BOSH and the instance is recreated with a new persistent disk, rabbitmq-server job on the new instance still fails to start and the following errors are observed in logs.
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> BOOT FAILED
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> ===========
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0> Error during startup: {error,previous_upgrade_failed}
2023-04-18 08:53:07.181272+00:00 [error] <0.223.0>
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> crasher:
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> initial call: application_master:init/4
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> pid: <0.222.0>
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> registered_name: []
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> exception exit: {previous_upgrade_failed,{rabbit,start,[normal,[]]}}
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> in function application_master:init/4 (application_master.erl, line 142)
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> ancestors: [<0.221.0>]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> message_queue_len: 1
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> messages: [{'EXIT',<0.223.0>,normal}]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> links: [<0.221.0>,<0.44.0>]
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> dictionary: []
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> trap_exit: true
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> status: running
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> heap_size: 233
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> stack_size: 28
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> reductions: 160
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0> neighbours:
2023-04-18 08:53:08.182577+00:00 [error] <0.222.0>
2023-04-18 08:53:08.182940+00:00 [notice] <0.44.0> Application rabbit exited with reason: {previous_upgrade_failed,{rabbit,start,[normal,[]]}}
2023-04-18 08:53:17.506658+00:00 [notice] <0.223.0> Logging: configured log handlers are now ACTIVE
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0> Found lock file at /var/vcap/store/rabbitmq/mnesia/db/schema_upgrade_lock.
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0> Either previous upgrade is in progress or has failed.
2023-04-18 08:53:19.192713+00:00 [error] <0.223.0> Database backup path: /var/vcap/store/rabbitmq/mnesia/db-upgrade-backup
Neither
upgrade-all-service-instances errand nor
cf upgrade-service command could fix the issue. So the solution is to remove the failing node from the cluster and manually add it back to the cluster.