VIO Rabbitmq not starting could not write file

Products

VMware VMware Integrated OpenStack

Issue/Introduction

Symptoms:

Trying to start services and rabbitmq fails
In the rabbitmq log, you see entries similar to:

root@photon-machine [ ~ ]# oslog rabbitmq1-rabbitmq-0
+ exec /docker-entrypoint.sh rabbitmq-server
2019-11-28 19:20:18.957 [info] <0.33.0> Application lager started on node
'rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local'
2019-11-28 19:20:18.983 [error] <0.5.0>
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 478
    rabbit:'-boot/0-fun-0-'/0 line 329
    rabbit_node_monitor:prepare_cluster_status_files/0 line 129
    rabbit_node_monitor:write_cluster_status/1 line 148
throw:{error,{could_not_write_file,"/var/lib/rabbitmq/mnesia/rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/cluster_nodes.config",
                                   enospc}}
Log file(s) (may contain more information):
   <stdout>

BOOT FAILED
===========

Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 478
    rabbit:'-boot/0-fun-0-'/0 line 329
    rabbit_node_monitor:prepare_cluster_status_files/0 line 129
    rabbit_node_monitor:write_cluster_status/1 line 148
throw:{error,{could_not_write_file,"/var/lib/rabbitmq/mnesia/rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/cluster_nodes.config",
                                   enospc}}
Log file(s) (may contain more information):
   <stdout>

{"init terminating in
do_boot",{error,{could_not_write_file,"/var/lib/rabbitmq/mnesia/rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/cluster_nodes.config",enospc}}}
init terminating in do_boot
({error,{could_not_write_file,/var/lib/rabbitmq/mnesia/rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/cluster_nodes.config,enospc}})

Note: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Environment

VMware Integrated Openstack 7.x
VMware Integrated OpenStack 6.x

Cause

The root cause is the 20GB space allocated for rabbitmq has run out. Most of the space is used under

~/mnesia/rabbit@rabbitmq1-rabbitmq-0.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/msg_stores/vhosts/[xxxxxxxxx]/queues/[yyyyyyyy]/*.idx

Resolution

This is a known issue affecting VMware Integrated Openstack 6.0.

Workaround:

Check disk usage of the rabbitmq pod:

osctl exec rabbitmq1-rabbitmq-0 df

Note: Pay attention to Use% of the following line:

/dev/sdc 20511312 50440 20444488 1% /var/lib/rabbitmq

Open interactive TTY to the rabbitmq pod

osctl exec -it rabbitmq1-rabbitmq-0 /bin/bash

Set rabbitmq TTL run the following commands:

for vhost in nova glance keystone neutron heat barbican cinder;do rabbitmqctl set_policy --vhost ${vhost} --priority 0 --apply-to all ha_ttl_${vhost} '(notifications)\.' '{"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":70000}' ; done

rabbitmqctl set_policy TTL ".*" '{"message-ttl":70000}' --apply-to queues

If the /dev/sdc partition is still full we need to clear out the *.idx files

rabbitmqctl list_queues
rabbitmqctl purge queue name=<name of queue>

Restart rabbitmq if stopped

rabbitmqctl force_boot

Additional Information

If you cannot get into the rabbitmq pod before it restarts, follow these instructions:

osctl edit statefulset rabbitmq1-rabbitmq
change command to: (add 'sleep 3600')

containers:
- command:
- bash
- -c
- |
sleep 3600
rabbitmqctl force_boot
/tmp/rabbitmq-start.sh

change livenessProbe to: (change initialDelaySeconds from 30 to 300)

livenessProbe:
exec:
command:
- /tmp/rabbitmq-liveness.sh
failureThreshold: 3
initialDelaySeconds: 300
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10

Save your changes
osctl delete po rabbitmq1-rabbitmq-2
After new rabbitmq1-rabbitmq-2 pod comes up,

osctl exec -ti rabbitmq1-rabbitmq-2 bash
rm -rf /var/lib/rabbitmq/mnesia/rabbit@rabbitmq1-rabbitmq-2.rabbitmq1-dsv-59862a.openstack.svc.cluster.local/msg_stores/vhosts
exit

Now use 'osctl edit statefulset rabbitmq1-rabbitmq' to revert the changes above.
osctl delete po rabbitmq1-rabbitmq-2

Now the new rabbitmq1-rabbitmq-2 pod should come up and become 1/1 Running.