RabbitMQ service not starting and showing red in vIDM dashboard
search cancel

RabbitMQ service not starting and showing red in vIDM dashboard

book

Article ID: 367757

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

During vIDM boot RabbitMQ is not starting and There was an error on vIDM dashboard with "There was a problem Messaging service Error retrieving RabbitMQ status"

Environment

VMware Identity Manager 3.3.x

Cause

If you see the following message in the horizon logs and got the cause for "Messaging Connection: Messaging connection test failed"

  1. 2019-02-01T17:53:43,781 WARN  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.channel.http.HttpChannel - Stop resending message to: http://127.0.0.1/AUDIT/API/1.0/REST/audit/consume. Status code: 500
  2. 2019-02-01T17:53:43,781 WARN  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessageSubscriber - Subscriber [id: -.analytics.06659a45-87c4-4009-bebc-26c0c59284d7] message added back to queue because: Cannot send message to: AnalyticsHttpChannel[callbackUri=http://127.0.0.1/AUDIT/API/1.0/REST/audit/consume,serviceAuthTokenProvider=com.vmware.horizon.components.identity.accesscontrol.ServiceAuthTokenProvider@55a7c5ef,sslUtils=com.vmware.horizon.security.utils.SSLUtils@597aa1d,defaultHttpClient=org.apache.http.impl.client.InternalHttpClient@310fbbe5,authMetadata=,httpPost=] (fail.send.callback.uri). [DeliveryTag:993]
  3. 2019-02-01T17:53:43,781 WARN  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessageSubscriber - Subscriber [id: -.analytics.06659a45-87c4-4009-bebc-26c0c59284d7] is retrying current message for 3th time
  4. 2019-02-01T17:53:43,781 INFO  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessageSubscriber - Subscriber [id: -.analytics.06659a45-87c4-4009-bebc-26c0c59284d7] has one message requeued.
  5. 2019-02-01T17:53:43,781 WARN  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessageSubscriber - Subscriber [id: -.analytics.06659a45-87c4-4009-bebc-26c0c59284d7] reached more than 10 errors in a row,

As more and more messages piled up in RabbitMQ. It could eat up all the hard disk space for RabbitMQ. Thus RabbitMQ connection will be blocked, i.e. unhealthy. Check the log below:

  1. 2019-02-08T00:44:25,647 INFO  (AMQP Connection 127.0.0.1:5672) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessagingProvider - Connection to localhost unblocked by RabbitMQ
  2. 2019-02-08T00:44:25,648 WARN  (subscriber-thread-285) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessageSubscriber - Subscriber [id: -.analytics.06659a45-87c4-4009-bebc-26c0c59284d7] reached more than 1000 errors in a row, disabling.
  3. 2019-02-08T00:44:31,808 WARN  (AMQP Connection 127.0.0.1:5672) [;;;] com.vmware.horizon.messaging.provider.rabbitmq.RabbitMQMessagingProvider - Connection to localhost blocked by RabbitMQ: low on disk
root@idm [ ~ ]# rabbitmqctl stop_app
Stopping rabbit application on node rabbitmq@vm-idm ...
Error: unable to perform an operation on node 'rabbitmq@idm'. Please see diagnostics information and suggestions below.
      
root@idm [ ~ ]# rabbitmqctl force_reset
Error: unable to perform an operation on node 'rabbitmq@idm'. Please see diagnostics information and suggestions below.
     
root@idm [ ~ ]# rabbitmqctl start_app
Starting node rabbitmq@vm-idm ...
Error: unable to perform an operation on node 'rabbitmq@idm'. Please see diagnostics information and suggestions below.

Resolution

1. Take snapshot and run following commands

    rabbitmqctl status
    rabbitmqctl list_queues | grep analytics
    service horizon-workspace stop 
    rabbitmqctl reset (did not work) - so we took the following steps
    rabbitmqctl stop_app
    rabbitmqctl force_reset
    rabbitmqctl start_app
    service horizon-workspace start" on each node (wait for workspace to be fully up before moving on the next one so there is no danger of downtime for users)

 

2. check the space on /db with "df". If there is plenty after clearing out RabbitMQ and elasticsearch, then they are good (should be < 10% used), otherwise increase the size of the /db filesystem. - 

vIDM appliance has no space left on device /db for audit data.
vIDM 3.3.x vPostgres DB OAuth2RefreshToken table consumes most space on the appliance leading to service outages.

3. If after this it still does not fix the messaging connection, do the following

    rabbitmqctl stop_app
    rabbitctl reset
    rabbitmqctl start_app
    rabbitmq-server -detached


4.if above commands do not resolved the issue run below command and RabbitMQ service will come back again on working state.

   systemctl stop rabbitmq-server.service
   rm -rf /db/rabbitmq/data
   service horizon-workspace restart
   systemctl start rabbitmq-server.service

 

5. if the above commands do not resolve the RabbitMQ issue try following steps

    /etc/systemd/system/multi-user.target.wants/rabbitmq-server.service 
    and remove " -detached &" from the ExecStart command/path. 
    Save the changes
 run below command and reboot,
 chown -R rabbitmq:rabbitmq /db/RabbitMQ

 

6. when the horizon.log - could see multiple RabbitMQ scheduler messages

    Running the following command showed that only 2 nodes were in the OpenSearch cluster:

    curl http://localhost:9200/_cluster/state/nodes,master_node?pretty

    To resolve this, performed the following on the 3 appliances:

    /etc/init.d/opensearch stop
    systemctl stop rabbitmq-server.service
    systemctl start rabbitmq-server.service
    rabbitmqctl list_queues | grep -i analytics
    rabbitmq-server -detached &
    /etc/init.d/opensearch start
   run  curl  http://localhost:9200/_cluster/state/nodes,master_node?pretty 
 (you can see the 3 appliances listed and master node and all was ok with opensearch)

    then run :

    /usr/sbin/hznAdminTool liquibaseOperations -forceReleaseLocks 
    service horizon-workspace restart