Supportability and Capacity Planning for Warm Standby Replication (WSR) with Unequal Cluster Sizes
search cancel

Supportability and Capacity Planning for Warm Standby Replication (WSR) with Unequal Cluster Sizes

book

Article ID: 442587

calendar_today

Updated On:

Products

VMware Tanzu RabbitMQ

Issue/Introduction

Supportability of Warm Standby Replication (WSR) between clusters with different node counts (e.g., a 5-node Production cluster replicating to a 3-node Disaster Recovery cluster).

Environment

Rabbitmq DR

Cause

Tanzu RabbitMQ Warm Standby Replication is an asynchronous, metadata-aware replication mechanism. While it does not technically enforce a 1:1 node ratio between the upstream (primary) and downstream (standby) clusters, operational stability is highly dependent on the resource availability at the destination site during promotion.

Resolution

Technical Supportability

Yes, WSR is technically supported between clusters of unequal sizes. The replication protocol focuses on the stream of messages and schema definitions rather than physical node mapping.

Critical Capacity Considerations

While the replication will function, administrators must account for the following risks during a Disaster Recovery (DR) event:

  1. Promotion Overload: If the production workload requires the resources of five nodes to maintain performance, promoting a 3-node standby site will likely lead to immediate resource exhaustion (CPU/Memory/Disk I/O). This can result in a failure cascade, making the DR site unusable.
  2. Selective Replication: To mitigate the risk of overloading a smaller DR site, it is recommended to replicate only mission-critical workloads. Non-essential queues or streams should be excluded from the replication set to ensure the DR site can handle the promoted load.
  3. Manual Promotion Requirement: Promotion remains a manual, operator-driven action. This is intentional to ensure that a human assesses whether the standby site is ready to receive the workload before the role-flip occurs.
  4. Network Throughput: Ensure that the network pipe between the sites can handle the aggregate throughput of the 5-node upstream cluster, as all replicated data will converge on the smaller downstream cluster.

Best Practices

  • Monitoring: Use the Tanzu RabbitMQ Management UI to monitor Latest timestamp and Message count on the standby site.
  • Retention Sizing: Configure standby.replication.retention.size_limit.messages on the upstream cluster to ensure that internal replication streams do not exhaust production disk space if the downstream cluster lags.
  • Testing: Regularly perform "Promotion Tests" using a subset of production data to validate the performance of the smaller DR cluster under load.