RabbitMQ tile Smoke Test failed due to timeout in 10.0.2+
search cancel

RabbitMQ tile Smoke Test failed due to timeout in 10.0.2+

book

Article ID: 396334

calendar_today

Updated On:

Products

VMware Tanzu RabbitMQ

Issue/Introduction

After upgrading Tanzu RabbitMQ on Cloud Foundry to 10.0.2+, the on demand instance smoke test could fail due to timeout. Creating a Service instance with multi-nodes takes more time to finish.

Here is an example deployment failure log:

Summarizing 4 Failures: 

            [TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-5node-qsync' 

            /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91 

            [FAIL] [SynchronizedAfterSuite]   

            /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153 

            [TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-3node-qsync' 

            /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91 

            [FAIL] [SynchronizedAfterSuite]   

            /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153 

            

          Ran 8 of 10 Specs in 3625.763 seconds 

          FAIL! - Suite Timeout Elapsed -- 6 Passed | 2 Failed | 0 Pending | 2 Skipped 

            

            

          Ginkgo ran 1 suite in 1h0m26.684015185s 

            

          Test Suite Failed 

In RabbitMQ service instance deployment - rabbitmq-service VM - /var/vcap/sys/log/rabbitmq-server/rabbit@xx, it can be found the peer discovery takes more time to finish in RabbitMQ tile 10.0.2+.

20##-##-## 07:04:26.578054+00:00 [info] <0.254.0> DB: virgin node -> run peer discovery

20##-##-## 07:16:32.842035+00:00 [error] <0.254.0> Peer discovery: could not discover and join another node; proceeding as a standalone node

 

Environment

Tanzu RabbitMQ on Cloud Foundry 10

Cause

Tanzu RabbitMQ on Cloud Foundry 10.0.2 is based on RabbitMQ 4.0.7, whereas earlier versions, for example, Tanzu RabbitMQ on Cloud Foundry 10.0.1 is based on RabbitMQ 4.0.3.

RabbitMQ 4.0.5+  includes a change that will make a node wait longer for other configured peers to show up. With the change, nodes will try to connect with retries. Unfortunately, this change conflicts with the way the tile deploys clusters. Currently, only one node is deployed initially and is expected to start before other nodes come online. With this modification, the first node will experience significantly longer startup times, introducing delays to the deployment process.

The settings that control those retries are `rabbit.cluster_formation.discovery_retry_limit` and `rabbit.cluster_formation.discovery_retry_interval`, and default to 30 (retries) and 1000 (milliseconds, or 1s).

Resolution

Temp resolution:

1. In the Ops Manager UI, navigate to the Tanzu RabbitMQ tile and go to the On-Demand Instance Plans section. For each plan, change the Expert Mode: Override Server Config option.

  • cluster_formation.discovery_retry_limit = 2
  • cluster_formation.discovery_retry_interval = 1000

2. Increase CPU on RabbitMQ on-demand broker VM.

 

Improvements in 10.0.3:

With RMQ tile 10.0.3, configuration of smoke-test timeout and selection of smoke-tests for particular plans have been introduced.

[smoke-tests timeout] The default smoke-test timeout is 60 minutes. This threshold can be increased if your environment requires a longer timeout.

[select smoke-tests for particular plans] On-demand plans offer the option to configure whether smoke tests are executed against them.

Refer:  Smoke tests configuration available in tile v10.0.3 and later