RabbitMQ tile Smoke Test failed due to timeout in 10.0.2+

Products

VMware Tanzu RabbitMQ

Issue/Introduction

After upgrading Tanzu RabbitMQ on Cloud Foundry to 10.0.2+, the on demand instance smoke test could fail due to timeout. Creating a Service instance with multi-nodes takes more time to finish.

Here is an example deployment failure log:

Summarizing 4 Failures:

[TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-5node-qsync'

/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91

[FAIL] [SynchronizedAfterSuite]

/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153

[TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-3node-qsync'

/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91

[FAIL] [SynchronizedAfterSuite]

/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153

Ran 8 of 10 Specs in 3625.763 seconds

FAIL! - Suite Timeout Elapsed -- 6 Passed | 2 Failed | 0 Pending | 2 Skipped

Ginkgo ran 1 suite in 1h0m26.684015185s

Test Suite Failed

In RabbitMQ service instance deployment - rabbitmq-service VM - /var/vcap/sys/log/rabbitmq-server/rabbit@xx, it can be found the peer discovery takes more time to finish in RabbitMQ tile 10.0.2+.

20##-##-## 07:04:26.578054+00:00 [info] <0.254.0> DB: virgin node -> run peer discovery

20##-##-## 07:16:32.842035+00:00 [error] <0.254.0> Peer discovery: could not discover and join another node; proceeding as a standalone node

Environment

Tanzu RabbitMQ on Cloud Foundry 10

Cause

Tanzu RabbitMQ on Cloud Foundry 10.0.2 is based on RabbitMQ 4.0.7, whereas earlier versions, for example, Tanzu RabbitMQ on Cloud Foundry 10.0.1 is based on RabbitMQ 4.0.3.

RabbitMQ 4.0.5+ includes a change that will make a node wait longer for other configured peers to show up. With the change, nodes will try to connect with retries. Unfortunately, this change conflicts with the way the tile deploys clusters. Currently, only one node is deployed initially and is expected to start before other nodes come online. With this modification, the first node will experience significantly longer startup times, introducing delays to the deployment process.

The settings that control those retries are `rabbit.cluster_formation.discovery_retry_limit` and `rabbit.cluster_formation.discovery_retry_interval`, and default to 30 (retries) and 1000 (milliseconds, or 1s).

Resolution

Temp resolution:

1. In the Ops Manager UI, navigate to the Tanzu RabbitMQ tile and go to the On-Demand Instance Plans section. For each plan, change the Expert Mode: Override Server Config option.