VIO Launching an instance fails with Error: Timed out waiting for a reply to message ID

Products

VMware VMware Integrated OpenStack

Issue/Introduction

Symptoms:
Launching an instance, it fails with the following error message:

Error: Failed to perform requested operation on instance "Instance name", the instance has an error status: Please try again later [Error: Timed out waiting for a reply to message ID].

Clicking on the instance, Fault section shows:

Message: Timed out waiting for a reply to message ID

Details: File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 405, in build_instances context, request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 449, in _schedule_instances hosts = self.scheduler_client.select_destinations(context, spec_obj) File "/usr/lib/python2.7/dist-packages/nova/scheduler/utils.py", line 372, in wrapped return func(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 51, in select_destinations return self.queryclient.select_destinations(context, spec_obj) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 32, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj) File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 154, in wrapper return f(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 123, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call retry=self.retry) File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send timeout=timeout, retry=retry) File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send retry=retry) File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send result = self._waiter.wait(msg_id, timeout) File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 'to message ID %s' % msg_id)

Environment

VMware Integrated OpenStack 5.x

Cause

RabbitMQ has encountered a network partition

Resolution

RabbitMQ clusters are network sensitive and can perceive network partitions that may not actually have occurred due to its mechanisms for communication. This leads to the perception that RabbitMQ may not tolerate network partitions robustly. Recommendation is to keep all network components, firmware, and drivers as up to date as practically compatible with all given solutions to help ensure optimal network performance.

To validate if RabbitMQ has determined if a network partition, validate the RabbitMQ Cluster Status:

1) Using SSH, log in to the VMware Integrated OpenStack manager.

2) From the VMware Integrated OpenStack manager, use SSH to log into one of the database nodes (i.e. database01).

3) Switch to root user.

sudo su -

4) Run command:

rabbitmqctl cluster_status

5) While performing the status check we see a result similar to below output, RabbitMQ has experienced a network partition:

Cluster status of node rabbit@database01 ...

[{nodes,[{disc,[rabbit@database01,rabbit@database02,rabbit@database03]}]},

{running_nodes,[rabbit@database03,rabbit@database01]},

{cluster_name,<<"rabbit@database01">>},

{partitions,[{rabbit@database03,[rabbit@database02]}]}]

To workaround this issue follow the next steps:

a) Log in to OpenStack Management Server (OMS) as viouser.

b) Switch to root user.

sudo su -

c) Run the following command:

viocli services stop && viocli services start

d) Once all the services are completely restarted, check again if network partition has been fixed, following the commands found in step 5. If the procedure has been successful, expected output should result:

Cluster status of node rabbit@database01 ...

[{nodes,[{disc,[rabbit@database01,rabbit@database02,rabbit@database03]}]},

{running_nodes,[rabbit@database02,rabbit@database03,rabbit@database01]},

{cluster_name,<<"rabbit@database01">>},

e) Try again to launch an instance.

Additional Information

How to access the password for VMware Integrated OpenStack components
[INTERNAL]VIO Log locations and descriptions
VIO deployment shutdown fails at 55 percent
[Internal] VIO rabbitmq queue, notifications.info, has no consumers and grows and consumes memory
[Internal] VIO - RabbitMQ Web GUI