Technical Supportability
Yes, WSR is technically supported between clusters of unequal sizes. The replication protocol focuses on the stream of messages and schema definitions rather than physical node mapping.
Critical Capacity Considerations
While the replication will function, administrators must account for the following risks during a Disaster Recovery (DR) event:
- Promotion Overload: If the production workload requires the resources of five nodes to maintain performance, promoting a 3-node standby site will likely lead to immediate resource exhaustion (CPU/Memory/Disk I/O). This can result in a failure cascade, making the DR site unusable.
- Selective Replication: To mitigate the risk of overloading a smaller DR site, it is recommended to replicate only mission-critical workloads. Non-essential queues or streams should be excluded from the replication set to ensure the DR site can handle the promoted load.
- Manual Promotion Requirement: Promotion remains a manual, operator-driven action. This is intentional to ensure that a human assesses whether the standby site is ready to receive the workload before the role-flip occurs.
- Network Throughput: Ensure that the network pipe between the sites can handle the aggregate throughput of the 5-node upstream cluster, as all replicated data will converge on the smaller downstream cluster.
Best Practices
- Monitoring: Use the Tanzu RabbitMQ Management UI to monitor Latest timestamp and Message count on the standby site.
- Retention Sizing: Configure standby.replication.retention.size_limit.messages on the upstream cluster to ensure that internal replication streams do not exhaust production disk space if the downstream cluster lags.
- Testing: Regularly perform "Promotion Tests" using a subset of production data to validate the performance of the smaller DR cluster under load.