Upgrade to RabbitMQ for PCF tile 1.13 fails with lock against the Mnesia DB

Products

VMware RabbitMQ

Issue/Introduction

Symptoms:
This issue is seen when upgrading the RabbitMQ tile from version 1.12 to 1.13.

In the documentation https://docs.pivotal.io/rabbitmq-cf/1-13/upgrade.html#-upgrade-the-rabbitmq-for-pcf-pre-provisioned-service it states to use the BOSH CLI to stop all but one of the RabbitMQ server nodes.

In some cases we have seen situations where the RabbitMQ servers nodes are not stopped prior to upgrading.
As a result the upgrade of the tile will fail.

From the output of bosh vms, one node can be seen as failing. In actual fact, all nodes have failed.

bosh -d test vms

Deployment 'test'

Instance                                              Process State  AZ              IPs        VM CID                                   VM Type    Active
rabbitmq-server/3e6f6338-3f28-44d9-8dc4-e19ebabc378a  failing        europe-west1-d  10.0.8.5   vm-373508de-f450-4ac0-7c60-97a0dd21c471  micro.cpu  true
rabbitmq-server/51bd76c2-b1d0-4614-84a2-a61ff5269ec7  running        europe-west1-b  10.0.8.14  vm-8419501d-166a-4008-67da-4af31801a2fc  micro.cpu  true
rabbitmq-server/f5b81c78-a652-4b7a-b4ca-947b606b3463  running        europe-west1-c  10.0.8.15  vm-7ae81aac-dc86-4b2e-6d77-234b121cfb34  micro.cpu  true

Environment

Cause

From the RabbitMQ server log on the failing node (/var/vcap/sys/log/rabbitmq-server/rabbit@****.log), there is a lock against the mnesia db:

2019-01-30 10:37:05.939 [error] <0.5.0> Cluster upgrade needed but other disc nodes shut down after this one.
Please first start the last disc node to shut down.

Note: if several disc nodes were shut down simultaneously they may all show this message. In which case, remove the lock file on one of them and start that node. The lock file on this node is:

 /var/vcap/store/rabbitmq/mnesia/db/nodes_running_at_shutdown
2019-01-30 10:37:05.940 [error] <0.5.0>
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 461
    rabbit:'-boot/0-fun-0-'/0 line 307
    rabbit_upgrade:run_mnesia_upgrades/2 line 155
    rabbit_upgrade:die/2 line 212
throw:{upgrade_error,"\n\n****\n\nCluster upgrade needed but other disc nodes shut down after this one.\nPlease first start the last disc node to shut down.\n\nNote: if several disc nodes were shut down simultaneously they may all\nshow this message. In which case, remove the lock file on one of them and\nstart that node. The lock file on this node is:\n\n /var/vcap/store/rabbitmq/mnesia/db/nodes_running_at_shutdown \n\n****\n\n\n"}

From a running vm, after running rabbitmqctl status , we can see that RabbitMQ is not running on these nodes:

$ rabbitmqctl status
Status of node rabbit@1b1ff0702e388b1535ce74c80b6c3df8
[{pid,10045},
{running_applications,[{ranch,"Socket acceptor pool for TCP protocols.",
                              "1.3.2"},
                       {ssl,"Erlang/OTP SSL application","8.2.6.4"},
                       {public_key,"Public key infrastructure","1.5.2"},
                       {asn1,"The Erlang ASN1 compiler version 5.0.5.2",
                             "5.0.5.2"},
                       {crypto,"CRYPTO","4.2.2.2"},
                       {compiler,"ERTS  CXC 138 10","7.1.5.2"},
                       {recon,"Diagnostic tools for production use","2.3.2"},
                       {xmerl,"XML parser","1.3.16.1"},
                       {inets,"INETS  CXC 138 49","6.5.2.4"},
                       {syntax_tools,"Syntax tools","2.1.4.1"},
                       {sasl,"SASL  CXC 138 11","3.1.2"},
                       {stdlib,"ERTS  CXC 138 10","3.4.5.1"},
                       {kernel,"ERTS  CXC 138 10","5.4.3.2"}]},

This confirms that RabbitMQ is down.

Resolution

Follow the steps below to recover the RabbitMQ cluster and then continue with the upgrade by following the documentation: https://docs.pivotal.io/rabbitmq-cf/1-13/upgrade.html#-upgrade-the-rabbitmq-for-pcf-pre-provisioned-service

1) SSH to the failing node. We will need to stop the erlang process here. The erlang process is used by rabbitmq-server & service-metrics monit services, so we will need to unmonitor them. Run:

monit summary will confirm that the rabbitmq-server process is not running
Run
- monit unmonitor rabbitmq-server
- monit unmonitor service-metrics
Run watch pgrep beam. If a process still exist, use the kill command to stop the process.

2) Go to each running node

If partition handling strategy is set to pause minority, run the following command consecutively on all running nodes: rabbitmqctl start_app
If partition handling strategy is set to autoheal, the rabbitmqctl start_app command can be run sequentially on the running nodes.

Go to each healthy node and run rabbitmqctl status. This will show RabbitMQ in a running state:

rabbitmqctl status
Status of node rabbit@1b1ff0702e388b1535ce74c80b6c3df8
[{pid,10045},
 {running_applications,
     [{rabbitmq_federation_management,"RabbitMQ Federation Management",
          "3.6.16"},
      {rabbitmq_shovel_management,
          "Management extension for the Shovel plugin","3.6.16"},
      {rabbitmq_management,"RabbitMQ Management Console","3.6.16"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.16"},

The cluster has now been rolled back to its previous state and can continue with upgrading the tile to 1.13.
Note: The failing node(s) will continue to show in a failed state.
It is recommended to bosh stop the failing node(s) and all other nodes except for one node.
Upgrade documentation is located here: https://docs.pivotal.io/rabbitmq-cf/1-13/upgrade.html#-upgrade-the-rabbitmq-for-pcf-pre-provisioned-service