VLSR - Calculating Bandwidth Requirements for vSphere Replication

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

Storage and network bandwidth requirements can increase when using vSphere Replication. Besides the network transfers of replicated data from the primary site to the vSphere Replication server, and the transfers from the vSphere Replication server to the ESXi host, the host writes the data to storage once and then, due to the use of redo log snapshots, reads the data back and rewrites it to storage.

The amount of network bandwidth that vSphere Replication requires to replicate virtual machines efficiently depends on several factors in your environment.

Network-based storage
Size of dataset
Data change rate
Recovery point objective (RPO)
Link speed

Network-Based Storage

Network bandwidth requirements increase if all storage is network-based, because data operations between the host and the storage also use the network. Each piece of replicated data travels over the network several times.

Between the host running the replicated virtual machine and the vSphere Replication server.
Between the vSphere Replication server and a host with access to the replication target datastore.
Between the host and storage.
When redo logs have to be collapsed, between storage and the host (twice).

You should be aware of these levels of traffic when planning your deployment, and if necessary acquire more networking hardware and resources to support your workload.

Network-based storage is primarily a concern when you are replicating virtual machines within a single vCenter Server instance that shares the network for all the items listed above. When you have two sites with a vCenter Server instance on each site, the main bottleneck is the link speed between the two sites. The rest of the load that occurs in the second vCenter Server instance is not as important as it is with a single vCenter Server instance.

Size of Dataset

Usually, vSphere Replication does not protect every virtual machine in your environment. Similarly, vSphere Replication does not necessarily protect every VMDK file in the protected virtual machines. To evaluate the size of the dataset that vSphere Replication protects, you must look at the datastores and calculate the percentage of the total storage that you use for virtual machines that you protect with vSphere Replication. You must then calculate the number of VMDKs within that subset that you have configured for replication.

For example, suppose you have 2TB of virtual machines on the datastores and you use vSphere Replication to protect half of these virtual machines. You might only protect a subset of the VMDKs, but this example assumes that all the VMDKs are protected. So, the maximum amount of data for replication is 1TB.

Data Change Rate

The data change rate, or churn, is the key to all calculations with vSphere Replication. vSphere Replication does not replicate every block of the protected dataset. The data change rate is very tightly coupled with the recovery point objective (RPO). To estimate the size of the data transfer for each replication, you must evaluate how many blocks change in a given RPO for a virtual machine. This is not always easy to estimate, so try to estimate overall averages. For example, you can estimate that you have a daily data change rate of 10% of the dataset that vSphere Replication protects. If the protected dataset is 1TB and the data change rate is 10% per day, the set of blocks to transfer each day equals approximately 100GB.

However, not all of the100GB necessarily needs to be transferred. vSphere Replication transfers blocks based on the RPO schedule. If you set an RPO of 1 hour, vSphere Replication transfers any block that has changed in that hour in order to meet that RPO. This does not mean that vSphere Replication transfers a block every time that it changes. If a given block changes 100 times within an hour, vSphere Replication does not transfer it 100 times. vSphere Replication only transfers the block once, in its current state at the moment that vSphere Replication creates the bundle of blocks for transfer. vSphere Replication only registers that the block has changed within the RPO period, not how many times it changed. The average daily data change rate therefore provides a high watermark rather than a realistic estimation of how much data vSphere Replication transfers, or how often the transfers occur.

If you set an RPO of 1 hour, replication occurs 24 times per day. So, if you assume that you have a maximum of 100GB to transfer every day, 100GB of data divided by 24 replications means that the average size of each replication is a little over 4GB.

Recovery Point Objective

The RPO is another key to calculating the traffic patterns for replication. vSphere Replication detects how many blocks have changed during the RPO period and only transfers the blocks that have changed. The data change rate within an RPO period provides the total number of blocks that vSphere Replication transfers. This number might vary throughout the day, which alters the traffic that vSphere Replication generates at different times. If you have systems that are busy during business hours but are idle at night, the overall average figure of 100GB to transfer each day might be accurate, but the 4GB per replication might vary significantly over the course of the 24 hour period.

Furthermore, vSphere Replication only looks at changed blocks, not how many times those blocks have changed, which can also lead to different sized bundles for replication from one replication to the next. If a virtual machine generates traffic in bursts, for example if it is very busy in one hour and then idle during the next, vSphere Replication might have to transfer a lot of blocks in one hour and none in the next. Moreover, if you use volume shadow copy service (VSS) to quiesce the virtual machine, the replication traffic cannot be spread out in small sets of bundles throughout the RPO period. Instead, vSphere Replication transfers all the changed blocks as one set as determined when the virtual machine was idle. Without VSS, vSphere Replication can transfer smaller bundles of changed blocks on an ongoing basis as the blocks change, spreading the traffic throughout the RPO period. So, the traffic changes if you use VSS, and vSphere Replication handles the replication schedule differently, leading to varying traffic patterns.

Finally, if you change the RPO, vSphere Replication transfers more or less data per replication to meet the new RPO. This is why calculating the required replication bandwidth is dependent on both the data change rate and the RPO setting.

Link Speed

If you know that you have to transfer an average replication bundle of 4GB in a 1 hour period, you must examine the link speed to determine if the RPO can be met. If you have a 10Mb link, under ideal conditions on a completely dedicated link with little overhead, then 4GB takes about an hour to transfer. So, in this case, meeting your RPO saturates a 10Mb WAN connection. The connection is saturated even under ideal conditions, with no overhead or limiting factors such as retransmits, shared traffic, or excessive bursts of data change rates.

Realistically, you can assume that only 70% of a link will be available for traffic replication. This means that on a 10Mb link you will obtain a link speed of about 3GB per hour, on a 100Mb link you will obtain a speed of about 30GB per hour, and so on.

Environment

VMware vSphere Replication 8.x | 9.x (external doc link in internal notes)

Resolution

To calculate the amount of bandwidth that vSphere Replication requires, you must calculate your average data change rate within an RPO period, divided by your link speed.

Identify the average data change rate within your RPO by calculating the average change rate over a longer period and dividing it by your RPO.
Calculate how much traffic this data change rate generates in each RPO period.
Measure the traffic against your link speed.

For example, a data change rate of 100GB requires approximately 200 hours to replicate on a T1 network, 30 hours to replicate on a 10Mbps network, 3 hours on a 100Mbps network, and so on.

If you have groups of virtual machines that have different RPO periods, for example if you have one group with an RPO of 15 minutes, one with an RPO of 1 hour, one with an RPO of 4 hours, and one with an RPO of 24 hours, you must calculate the replication time for each group of virtual machines. You must then examine how the RPO is met by the data change rate, traffic rates, and link speed, and then look at the aggregate of each group. You must factor in all the different RPOs in the environment, the subset of virtual machines in your environment that is protected, the change rate of the data within that subset, how much of that data changes within each configured RPO, and the link speeds in your network.

Public Documentation is online at:

Bandwidth Requirements for vSphere Replication
Bandwidth Requirements for vSphere Replication

Additional Information:

In the VMware community supported Flings Labs there is a vSphere Replication Capacity Planning Appliance available for download. For more information, see the Fling page of the vSphere Replication Capacity Planning Appliance.

For translated versions of this article, see:

日本語: vSphere Replication の帯域幅の要件の計算 (2096650)
简体中文: 计算 vSphere Replication 的带宽需求 (2102120)

Additional Information

Public Documentation:

Bandwidth Requirements for vSphere Replication
Bandwidth Requirements for vSphere Replication