After upgrading Tanzu RabbitMQ on Cloud Foundry to 10.0.2+, the on demand instance smoke test could fail due to timeout. Creating a Service instance with multi-nodes takes more time to finish.
Here is an example deployment failure log:
Summarizing 4 Failures:
[TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-5node-qsync'
/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91
[FAIL] [SynchronizedAfterSuite]
/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153
[TIMEDOUT] Smoke tests [It] pushes an app, sends, and reads a message from RabbitMQ over TLS: plan 'multi-node-large-3node-qsync'
/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:91
[FAIL] [SynchronizedAfterSuite]
/var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/vendor/github.com/cloudfoundry/cf-test-helpers/v2/workflowhelpers/test_suite_setup.go:153
Ran 8 of 10 Specs in 3625.763 seconds
FAIL! - Suite Timeout Elapsed -- 6 Passed | 2 Failed | 0 Pending | 2 Skipped
Ginkgo ran 1 suite in 1h0m26.684015185s
Test Suite Failed
In RabbitMQ service instance deployment - rabbitmq-service VM - /var/vcap/sys/log/rabbitmq-server/rabbit@xx, it can be found the peer discovery takes more time to finish in RabbitMQ tile 10.0.2+.
20##-##-## 07:04:26.578054+00:00 [info] <0.254.0> DB: virgin node -> run peer discovery
20##-##-## 07:16:32.842035+00:00 [error] <0.254.0> Peer discovery: could not discover and join another node; proceeding as a standalone node
Tanzu RabbitMQ on Cloud Foundry 10
Tanzu RabbitMQ on Cloud Foundry 10.0.2 is based on RabbitMQ 4.0.7, whereas earlier versions, for example, Tanzu RabbitMQ on Cloud Foundry 10.0.1 is based on RabbitMQ 4.0.3.
RabbitMQ 4.0.5+ includes a change that will make a node wait longer for other configured peers to show up. With the change, nodes will try to connect with retries. Unfortunately, this change conflicts with the way the tile deploys clusters. Currently, only one node is deployed initially and is expected to start before other nodes come online. With this modification, the first node will experience significantly longer startup times, introducing delays to the deployment process.
The settings that control those retries are `rabbit.cluster_formation.discovery_retry_limit` and `rabbit.cluster_formation.discovery_retry_interval`, and default to 30 (retries) and 1000 (milliseconds, or 1s).
Temp resolution:
1. In the Ops Manager UI, navigate to the Tanzu RabbitMQ tile and go to the On-Demand Instance Plans section. For each plan, change the Expert Mode: Override Server Config option.
2. Increase CPU on RabbitMQ on-demand broker VM.
Improvements in 10.0.3:
With RMQ tile 10.0.3, configuration of smoke-test timeout and selection of smoke-tests for particular plans have been introduced.
[smoke-tests timeout] The default smoke-test timeout is 60 minutes. This threshold can be increased if your environment requires a longer timeout.
[select smoke-tests for particular plans] On-demand plans offer the option to configure whether smoke tests are executed against them.
Refer: Smoke tests configuration available in tile v10.0.3 and later