The following error can be seen when upgrading RabbitMQ:
Error 450001: Action Failed get_task: Task 90d658cd-da6c-4e19-40b9-6cc0985ce444 result: Unmounting persistent disk: Running command: 'umount /dev/sdc1', stdout: '', stderr: 'umount: /var/vcap/store: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) ': exit status 1 Task 6 error
There are a number of states that a RabbitMQ node can be in when in a cluster and some of these states are known to cause upgrades to fail with device busy error.
The first thing you will want to do when considering upgrading a RabbitMQ deployment is to ascertain the state of each RabbitMQ node in the cluster. In order to successfully upgrade a Rabbit cluster, the operator should ensure that each node in the cluster is in an upgradeable state. So we have two areas to examine:
1. How to identify which state a given node is in
There are two tools which are invaluable in discerning the state of a given RabbitMQ server node. Both require that you bosh ssh
onto the node in question, and sudo su -
to get system privileges. The incantations are:
watch monit status PATH=$PATH:/var/vcap/packages/erlang/bin /var/vcap/packages/rabbitmq-server/bin/rabbitmqctl status
In order to determine the state of a given RabbitMQ server node, run the incantations above, and cross-reference the results below. The following 2 states are the known states that an upgrade will always succeed from . If you see one of these states and the upgrade fails, then please let us know by opening a support request and collect logs from all RabbitMQ node VMs.
Monit state: Running - RabbitMQ: up Monit state: Running - Clusterer: Clusterer
The following are known bad states and upgrades will fail. Steps need to be taken to get nodes into a good state.
Monit state: Trying - RabbitMQ: up Monit state: Stopped - RabbitMQ: Down Monit state: BoshStopped - RabbitMQ: Down
To maximise the chances of a successful upgrade, all Rabbitmq-server nodes in a given cluster should be either in state Running/Up or Running/Clusterer. It is perfectly ok for a cluster to contain some nodes in state Running/Up and some other nodes in state Running/Clusterer.
The important thing is that no nodes in the cluster should be in any other state. If you do find another state is possible, please record it so that engineering can investigate, and continue to get that node into
a known good state.
2. For each possible state, how to move into an upgradeable state
2.1 Monit state: Trying - RabbitMQ: up
In this state, Rabbit is running in an Erlang VM, but monit has lost track of it. Monit knows that Rabbit should be running, so it runs Rabbit’s start script. This fails, because Rabbit has already started. Monit tries again, and keeps trying forever.
To move from this state-to-state Running/Clusterer, all we have to do is stop Rabbit. This will put the node in a state where monit is able to successfully bring rabbit back up, and resume control. To achieve this, run:/var/vcap/jobs/rabbitmq-server/bin/rabbitmq-server.init stop
This should exit 0, and the node should enter the state Running/Clusterer.
2.2 Monit state: Stopped - Rabbitmq: Down
In this state monit believes Rabbit to be deliberately down, and Rabbit is, in fact, down.
From here it is possible to move to Running/Clusterer (which is a good state) by running the following from any machine on which the BOSH director is targeted:bosh start $JOB_NAME
Where $JOB_NAME is the name of the BOSH job that corresponds to this Rabbit server node. For example rmq_z0/0.
2.3 Monit state: BoshStopped - Rabbitmq: Down
This state means that BOSH has attempted to stop all the services and has removed all the monit spec files. The rabbitmq-server service will not be displayed in a monit status
. The Erlang VM is running and does not report the clusterer or Rabbit running.
From here we can get to Running/Clusterer which is a good state by running the following on a machine that is targeting your BOSH director:bosh start $JOB_NAME
Where $JOB_NAME is the name of the BOSH job that corresponds to this Rabbit server node. For examplermq_z0/0.
3. How to determine state of the RabbitMQ system:
3.1 Monit States
To determine the state of a node, the command that needs to be invoked is as follows:watch monit status
For those who have used monit before: notice that we require the full information from monit status
. A monit summary
is not enough. The key here is that we need both the status and the monitoring status of our RabbitMQ-server. Sometimes PID-related information can also be reassuring, but it is possible to distinguish between all our states without it. For ease of reference, we have highlighted these key lines in bold in the output below.
3.1.1 Running
This state means that Monit (and BOSH) believes that it is monitoring RabbitMQ-server and that it is running.
Process 'rabbitmq-server' status running monitoring status monitored pid 21077 parent pid 21075 uptime 8m children 1 memory kilobytes 1680 memory kilobytes total 96864 memory percent 0.0% memory percent total 1.1% cpu percent 0.0% cpu percent total 0.4% data collected Thu Jul 7 10:33:34 2016
3.1.2 Trying
This state means that Monit is attempting to monitor RabbitMQ-server and it believes that RabbitMQ-server is not running. It is attempting to launch RabbitMQ-server, but it is failing to do so and cycles between these outputs, every 30 seconds or so.
Process 'rabbitmq-server' status not monitored monitoring status not monitored data collected Wed Jul 6 14:29:46 2016 Process 'rabbitmq-server' status Execution failed monitoring status monitored data collected Wed Jul 6 14:29:46 2016
3.1.3 Stopped
This state means that Monit is not monitoring RabbitMQ-server and that it believes that it’s not running.
Process 'rabbitmq-server' status not monitored monitoring status not monitored data collected Wed Jul 6 14:28:26 2016
3.1.4 Failed
This state means that Monit is monitoring RabbitMQ-server but it is not able to discern whether or not it is successfully running.
Process 'rabbitmq-server' status Execution failed monitoring status monitored pid 5286 parent pid 1 uptime 21h 50m children 0 memory kilobytes 105400 memory kilobytes total 105400 memory percent 1.2% memory percent total 1.2% cpu percent 0.4% cpu percent total 0.4% data collected Wed Jul 6 14:39:46 2016
3.1.5 BOSH Stopped
This state is when the node has been stopped via BOSH stop. This removes all traces of the monit spec files related to the RabbitMQ-server service so that Monit doesn’t even know of its existence. So it is neither monitored and nor does Monit have any clue about the state of rabbitmq-server.
/var/vcap/bosh/etc/monitrc:8: Warning: include files not found '/var/vcap/monit/job/*.monitrc' The Monit daemon 5.2.5 uptime: 0m System 'system_localhost' status running monitoring status monitored load average [0.28] [0.26] [0.28] cpu 0.5%us 0.5%sy 0.0%wa memory usage 135736 kB [1.6%] swap usage 0 kB [0.0%] data collected Thu Jul 7 10:38:50 2016
3.2 Rabbit States
This is separate from the Monit state because the Erlang VM has its own opinion of what it means for RabbitMQ to be running.
To determine the state of a node, the command that needs to be invoked is as follows:
PATH=$PATH:/var/vcap/packages/erlang/bin /var/vcap/packages/rabbitmq-server/bin/rabbitmqctl status
There are two key lines we are looking for in all these command outputs:
{rabbit,"RabbitMQ","3.6.2"},
[{rabbitmq_clusterer,"Declarative RabbitMQ clustering",[]},
It is never the case that Rabbit runs without the clusterer. So there are three possible states: both Rabbit and the clusterer are running; only the clusterer is running; neither Rabbit nor the clusterer are running.
3.2.1 Up
This state means that this node is running both the “RabbitMQ” application and the clusterer. This can be seen in the running_applications section in the following output.
Status of node rabbit@6d627c95e340ee649ae3bc89c00730d1 ... [{pid,25618}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.6.2"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.2"}, {rabbit,"RabbitMQ","3.6.2"}, {mnesia,"MNESIA CXC 138 12","4.13.1"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.2"}, {webmachine,"webmachine","1.10.3"}, {mochiweb,"MochiMedia Web Server","2.13.1"}, {ssl,"Erlang/OTP SSL application","7.1"}, {public_key,"Public key infrastructure","1.0.1"}, {asn1,"The Erlang ASN1 compiler version 4.0","4.0"}, {os_mon,"CPO CXC 138 46","2.4"}, {compiler,"ERTS CXC 138 10","6.0.1"}, {syntax_tools,"Syntax tools","1.7"}, {amqp_client,"RabbitMQ AMQP Client","3.6.2"}, {xmerl,"XML parser","1.3.8"}, {inets,"INETS CXC 138 49","6.0.1"}, {rabbit_common,[],"3.6.2"}, {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"}, {crypto,"CRYPTO","3.6.1"}, {rabbitmq_clusterer,"Declarative RabbitMQ clustering",[]}, {sasl,"SASL CXC 138 11","2.6"}, {stdlib,"ERTS CXC 138 10","2.6"}, {kernel,"ERTS CXC 138 10","4.1"}]}, {os,{unix,linux}}, {erlang_version, "Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true]\n"}, {memory, [{total,84945384}, {connection_readers,0}, {connection_writers,0}, {connection_channels,0}, {connection_other,2808}, {queue_procs,2808}, {queue_slave_procs,0}, {plugins,415792}, {other_proc,18845480}, {mnesia,70248}, {mgmt_db,33752}, {msg_index,41168}, {other_ets,1568984}, {binary,65128}, {code,27810600}, {atom,1000601}, {other_system,35088015}]}, {alarms,[]}, {listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,3348963328}, {disk_free_limit,50000000}, {disk_free,9850142720}, {file_descriptors, [{total_limit,299900}, {total_used,2}, {sockets_limit,269908}, {sockets_used,0}]}, {processes,[{limit,1048576},{used,201}]}, {run_queue,0}, {uptime,362}, {kernel,{net_ticktime,60}}]
3.2.2 Down
This state means that neither RabbitMQ nor the clusterer erlang application is running. This node would be considered non-functional.
Status of node rabbit@6d627c95e340ee649ae3bc89c00730d1 ... Error: unable to connect to node rabbit@6d627c95e340ee649ae3bc89c00730d1: nodedown DIAGNOSTICS =========== attempted to contact: [rabbit@6d627c95e340ee649ae3bc89c00730d1] rabbit@6d627c95e340ee649ae3bc89c00730d1: * connected to epmd (port 4369) on 6d627c95e340ee649ae3bc89c00730d1 * epmd reports: node 'rabbit' not running at all other nodes on 6d627c95e340ee649ae3bc89c00730d1: ['rabbitmq-cli-73'] * suggestion: start the node current node details: - node name: 'rabbitmq-cli-73@localhost' - home dir: /var/vcap/store/rabbitmq - cookie hash: vKwSjjYjOvnJCZmZWKceSg==
3.2.3 Clusterer
This state means that the RabbitMQ clusterer plugin is active and waiting for the other nodes of the cluster to come online. This node would be considered not functioning because the “RabbitMQ” application is not listed in the running_applications output.
Notice that while the clusterer is clearly running, the RabbitMQ application is entirely missing from the output.
Status of node rabbit@6d627c95e340ee649ae3bc89c00730d1 ... [{pid,25618}, {running_applications, [{rabbitmq_clusterer,"Declarative RabbitMQ clustering",[]}, {sasl,"SASL CXC 138 11","2.6"}, {stdlib,"ERTS CXC 138 10","2.6"}, {kernel,"ERTS CXC 138 10","4.1"}]}, {os,{unix,linux}}, {erlang_version, "Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true]\n"}, {memory, [{total,58852704}, {connection_readers,0}, {connection_writers,0}, {connection_channels,0}, {connection_other,0}, {queue_procs,0}, {queue_slave_procs,0}, {plugins,54000}, {other_proc,19251416}, {mnesia,0}, {mgmt_db,0}, {msg_index,0}, {other_ets,400240}, {binary,303936}, {code,5610076}, {atom,256313}, {other_system,32976723}]}, {alarms,[]}, {listeners,[]}, {processes,[{limit,1048576},{used,45}]}, {run_queue,0}, {uptime,44}, {kernel,{net_ticktime,60}}]