Troubleshooting vSphere replication slowness and RPO Violations
search cancel

Troubleshooting vSphere replication slowness and RPO Violations

book

Article ID: 312689

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction


The purpose of this article is to aid your understanding about the multitude of factors that affect vSphere replication performance and to assist you in narrowing down the problem.


Symptoms:

1. Replications are slow 

2. Replications do not complete 

3. VMs show frequent RPO violation errors under replications tab of SRM UI

Environment

VMware vSphere Replication 
VMware Live Site Recovery

Cause


1. vSphere Replication appliance configuration (CPU, Memory & networking)

2. Network configuration of vSphere, external switches, routers, firewall and WAN appliances 

3. Network performance is poor or inconsistent 

4. Poor replication bandwidth allocation between sites

5. Aggressive RPO settings

6. Storage latency 

7. Limited compute resources on hosts 

Resolution

 

Recovery time objective (RTO): Targeted amount of time a business process should be restored after a disaster or disruption in order to avoid unacceptable consequences associated with a break in business continuity.

Recovery point objective (RPO): Maximum age of files recovered from backup storage for normal operations to resume if a system goes offline as a result of a hardware, program, or communications failure.

vSphere Replication is host-based replication, it is independent of the underlying storage and it works with a variety of storage types including vSAN, traditional SAN, NAS, and direct-attached storage (DAS).

After the initial full synchronization, changes to the protected virtual machine are tracked and replicated on a regular basis. The transmissions of these changes are referred to as “lightweight delta syncs.” Their frequency is determined by the RPO that was configured for the virtual machine. A lower RPO requires more-frequent replication.

Troubleshooting replication slowness or Recovery Point Objective (RPO) violations is a complex process and requires a systematic approach. This article aims at simplifying the troubleshooting process and narrowing down to the crux of the problem. It's important to gather as much information as possible about your environment and involve appropriate resources to help diagnose and resolve the underlying issues.


Familiarize yourself with the following concepts.

Recovery Point Objective

Interpreting Replication Statistics for a Site

Bandwidth Requirements for vSphere Replication

Calculate Bandwidth For vSphere Replication

vSphere Replication Calculator

Best Practices for Using and Configuring vSphere Replication

It might be required to upgrade the below software depending on the nature of the problem, doing so can improve replication performance or fix other issues mentioned in their release notes pertaining to replication or other bug fixes and improvements. A major release would be something like 8.8 whereas maintenance & patch releases would be something like 8.8.0.1, etc. When upgrading, we always recommend to upgrade to a major release first and then to a subsequent maintenance or patch release, this is the proper method of upgrading a VR or SRM appliance.

1. vCenter

2. ESXi hosts

3. vSphere replication servers

4. Site Recovery Manager

NOTE: The version of vSphere replication appliance you deploy or upgrade to must be compatible with vSphere hypervisor (ESXi), so please check the VMware Product Interoperability matrix. You must decide the version of VR depending on the lowest version of ESXi being used at the production and/or recovery site because an incompatible ESXi version and VR can lead to unknown problems with replication and recovery. Not conforming to the compatibility matrix can lead to problems with replication, recovery and reprotect. 

Troubleshooting can be divided into 4 categories, sometimes it could be easy to spot errors from one of them and address the issue, otherwise we have to amass information from all of them and collectively think where the problem lies.

1. ESXi host (Source & Target)

2. VRMS & VR Add-on server (Source & Target)

3. Storage

4. Networking 

 

ESXi host (Source & Target)

Get the following information before proceeding with the steps below. You must identify the following details -

1. VMs that is displaying RPO Violation errors

2. The host these VMs reside on

3. The replication server that is receiving the replication of this VM (Target VRMS or VR Add-on)

Run the following commands to get the information below -

vim-cmd vmsvc/getallvms > To get the Vmid
vim-cmd hbrsvc/vmreplica.getState
vim-cmd hbrsvc/vmreplica.getConfig
vim-cmd hbrsvc/vmreplica.queryReplicationState
vim-cmd hbrsvc/vmreplica.sync

Example commands below -

vim-cmd hbrsvc/vmreplica.getState 31
Retrieve VM running replication state:
    The VM is configured for replication. Current replication state: Group: GID-ec46ed28-ce08-49e7-8fd8-8319e7db7d76 (generation=8793006158342481)
    Group State: inactive
        DiskID RDID-f0e41b61-62c8-480b-904e-84b04d5d1fe4 State: inactive

[root@ESXi67APR:~] vim-cmd hbrsvc/vmreplica.getConfig 31
Retrieve VM replication configuration:
    The VM is configured for replication with the following options:
        VM Replication ID = GID-ec46ed28-ce08-49e7-8fd8-8319e7db7d76
        Destination IP Address = 192.168.50.7
        Destination Port = 31031
        Recovery Point Objective = 1440
        Quiesce Guest OS = false
        Enable Opportunistic Updates = false
        Network Compression = false
        Network Encryption = false
        Paused for Replication = false
        Disk scsi0:0 is configured for replication:
            Device key = 2000
            Replication ID = RDID-f0e41b61-62c8-480b-904e-84b04d5d1fe4

[root@ESXi67APR:~] vim-cmd hbrsvc/vmreplica.queryReplicationState 31
Querying VM running replication state:
Current replication state:
    State: idle

NOTE: State changes from idle to syncing when data transfer begins.

Running the SYNC NOW operation from the SRM UI Replications tab or the command below will be necessary to identify what errors are being logged in the source host & target VR server when the host attempts to send replication traffic to the target replication appliance. Running this command prior to collecting source host and target replication appliance logs will be very helpful in analyzing the problem accurately.

[root@ESXi67APR:~] vim-cmd hbrsvc/vmreplica.sync 31
Force a replica synchronzation for the VM:

GID > Group ID
RDID > Replication Disk ID (Each VMDK has a unique RDID)

NOTE: GID & RDID values change everytime you remove and re-add a VM to replication. If you are collecting logs for analysis, you will have to collect them post removing or re-adding a VM to replication, if required.

After collecting the information above, check hostd, vmkernel & vmkwarning logs using GID, RDID values on the source host following up with hms & hbrsrv logs on the target replication server.

Look at Solutions for Common vSphere Replication Problems to identify trending problems with the version of VR you are troubleshooting and apply those fixes. 



VRMS & VR Add-on server (Source & Target)

After investigating the source host logs, we need to check the VRMS or VR Add-on server logs to find out if the replication appliance is having any issues in managing the replication of the VM in question or processing replication traffic in general.

If a VM is being replicated by VR Add-on server, we need to collect both VRMS & VR Add-on server logs for analysis and check hms & hbrsrv logs.

NOTE: hms logs are only found in VRMS whereas VR Add-on server only has hbrsrv logs.

HMS  > HBR management system
hbrsrv > Host-based replication sever

Right sizing your vSphere Replication Appliance: If the VR is overloaded or under-resourced, it can affect replication performance. Consider increasing the resources allocated to the appliance (CPU or Memory). Deploy additional vSphere Replication servers to meet your load-balancing needs, this also depends on other design and infrastructure requirements. Please consult with a SRM Engineer if you aren't sure if you need one.

Large deployments require increasing vSphere Replication Server memory (312759)

Slow Replication Performance on vSphere Replication Virtual Machines with 4 vCPU (341171)



Storage

RPO Violations can also be caused due to underlying storage infrastructure involved in replication. Identify any storage related issues such as high latency or I/O contention, that might be impacting replication performance. These issues must be checked on both source and target datastores where the replicated VMs reside & the arrays in general. When the ESXi hosts on the target site has difficulty in committing data to the datastore due to high storage latency, this behavior also results in RPO violations. Consider optimizing storage configurations or upgrading to faster storage devices if necessary.

Every virtual machine in a datastore generates regular read and write operations. Configuring vSphere Replication on those virtual machines adds another read operation to the regular read and write operations, which increases the I/O load on storage. The performance of vSphere Replication depends on the I/O load of the virtual machines that you replicate and on the capabilities of the storage hardware. If the load generated by the virtual machines, combined with the extra I/O operations that vSphere Replication introduces, exceeds the capabilities of your storage hardware, you might experience slow response times.

Identify storage problems with the source and target datastores & fix them.

Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (344099)

Troubleshooting ESX/ESXi virtual machine performance issues (304594)

If there are too many replicated VMs residing in a datastore, consider spreading them across a few different datastores to balance the load. This will help in reducing latency. When running vSphere Replication, if response times are greater than 30 ms, reduce the number of virtual machines that you replicate to the datastore. Alternatively, increase the capabilities of your hardware.

When using vSphere replication, if you encounter latency spikes greater than 20 - 30ms or higher constantly (the benchmark latency could vary depending on the storage array being used), consider doing the following on the source and target datastores -

1. Decrease the number of VMs being replicated from the datastores.
2. Consider spreading the replicated VMs load across multiple datastores.
3. Its good to have 2+ hosts at the recovery site from the perspective of resiliency, recovery and to handling replication traffic load via NFC. 




Networking

RPO violations can occur due to a number of networking changes in the environment. The 5 minute RPO can be applied to a maximum of 500 VMs on VMFS 6.0, VMFS 5.x, NFS 4.1, NFS 3, and vSAN 6.2 Update 3 storage and later. The maximum for vVol datastore is 50 VMs. The maximum number of replications with 5 minute RPO can vary, depending on the network bandwidth and the change rates per disk.

1. vSphere replication management IP address is changed

2. IP Address for Incoming Storage Traffic in VR VAMI is changed

3. Static routes are not implemented within the VR appliance. Refer to the KB - Isolating the Network Traffic of vSphere Replication (312753)

4. Routing of traffic on the L3 layer between the sites is wrong or the shortest path is not chosen causing the replication traffic to deviate to other parts of the geography.

5. Check for any bottlenecks in the environment such as slow network or storage devices that could be affecting replication performance.

6. Verify the RPO value set in the vSphere Replication Policy. If the RPO value is too low/aggressive, consider increasing it to allow more time for replication to complete. Depending on the nature of application running on the VM, you will have to tweak the RPO value over a period of time to best suit the VMs/Applications need. The perception of 1 RPO value suits all is wrong because the RW IOPS of each VM is different & this could very much impact the RPO depending on the frequency of data changes.

7. Consider WAN optimization techniques: If you are replicating across a wide area network (WAN), consider implementing WAN optimization techniques such as data compression, deduplication, or traffic shaping. These techniques can help reduce the amount of data transferred over the network and improve replication performance.

8. Verify if port # 31031/32032 has a low priority; QOS could be throttling down the traffic

9. Try restarting the services on the WAN appliances or reboot them and check if the replication speeds improve. WAN links can be used for various purposes besides being used for replication traffic and QoS can be implemented to prioritize certain traffic over other. If this is the case, provide high priority to replication traffic for a day during the weekends and check if it improves replication speed.

10. Ensure that all ports required for replication traffic is open on the firewalls including NSX (if in use)

11. Ensure that the network connection between the source and target sites is stable and has sufficient bandwidth. Check for any network congestion or packet loss that might be affecting replication performance.

12. If the replication bandwidth between sites is low, enabling network compression for VR data will help in saving network bandwidth.

13.Check if IDS/IPS is enabled on the firewall that is filtering replication traffic packets which could slow down or stall replication from completing.

14. Check if you can physically bypass the firewall at the target site and connect the link directly to the core switch. Check if this improves the replication performance by any measure, this should give you some sort of a clue about where the traffic might be getting jammed. 

What is IDS and IPS?

Intrusion detection is the process of monitoring your network traffic and analyzing it for signs of possible intrusions, such as exploit attempts and incidents that may be imminent threats to your network. For its part, intrusion prevention is the process of performing intrusion detection and then stopping the detected incidents, typically done by dropping packets or terminating sessions. These security measures are available as intrusion detection systems (IDS) and intrusion prevention systems (IPS), which are part of network security measures taken to detect and stop potential incidents.

Test the port connectivity from the source ESXi host to the target VR appliance -

nc -zv <Target VR Appliance IP address> 31031

Example: nc -v -w 2 -z 10.193.19.213 31031

Run nc -help for more information

31031 - Replication traffic without network encryption.
32032 - Replication traffic with network encryption.

Services, Ports, and External Interfaces That the vSphere Replication Virtual Appliance Uses

Check the ESXi host and find out the vmnics used for transferring replication traffic and run the command below to check NIC statistics 

esxcli network nic stats get -n vmnic

Example:

[root@ESXi67APR:~] esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
  Packets received: 143496373
  Packets sent: 120923042
  Bytes received: 54913464435
  Bytes sent: 14446916301
  Receive packets dropped: 0
  Transmit packets dropped: 0

  Multicast packets received: 290923
  Broadcast packets received: 457143
  Multicast packets sent: 36480
  Broadcast packets sent: 36938
  Total receive errors: 0
  Receive length errors: 0
  Receive over errors: 0
  Receive CRC errors: 0
  Receive frame errors: 0
  Receive FIFO errors: 0
  Receive missed errors: 0
  Total transmit errors: 0
  Transmit aborted errors: 0
  Transmit carrier errors: 0
  Transmit FIFO errors: 0

  Transmit heartbeat errors: 0
  Transmit window errors: 0

If you find an increased number of errors being reported on the highlighted errors above, please open a case with vSphere networking team to fix this. A NIC driver/firmware upgrade maybe required to resolve this issue. You can check if the replication performance increases or RPO improves after this.

Look for NFC & vmnic errors in the hostd, VMkernel & vobd logs for further clues to troubleshoot networking issues.

Commands to perform packet captures:

pktcap-uw --vmk vmk2 --dir 2 --ip <vr-ip> -o ./InandOut.pcap
pktcap-uw --uplink vmnic0 --dir 2 --ip <vr-ip> -o ./InandOut.pcap

Using the pktcap-uw tool in ESXi 5.5 and later (341568)

NOTE: In vSphere 6.5 and earlier, specify the direction of traffic using --dir 0 for inbound and --dir 1 for outbound. You can’t specify traffic going both ways at the same time. However, in vSphere 6.7 and later, you can specify the direction of traffic using --dir 0 for inbound, --dir 1 for outbound, or --dir 2 for both.

To cancel the pcap press: Ctrl + c

To kill all instances of pktcap-uw: kill $(lsof | grep pktcap-uw |awk '{print $1}'| sort -u)  

To verify that all pktcap-uw traces are stopped: lsof | grep pktcap-uw |awk '{print $1}'| sort -u

4. Ensure MTU is configured uniformly across all networking devices that support it between the sites including vSphere switches, ESXi hosts & vSphere  Replication Appliance.

Testing VMkernel network connectivity with the vmkping command (344313)

Example: Testing using vmkping commands.

A. vmkping -I vmk2 -d -s 8972 Target-VR_IP (Use this command to test with 9000 MTU (Jumbo frames))
B. vmkping -I vmk2 -d -s 1472 Target-VR_IP (Use this command to test with 1500 MTU)

NOTE:

vSphere Replication by default uses a MTU (maximum transmission unit) of 1500. Achieving a MTU size of 1500 would be impossible on a WAN that uses VPN tunnels, IPsec encryption, overlay protocols & other firewalls that may be set at a different MTU size that doesn't match with the MTU set within the datacenter. Henceforth, the result of this VMKPING test may pass or fail but it shouldn't be considered as a direct indicator of this problem until you have explored all other possibilities. Try changing the MTU to a random size between 1500-9000 and check if you can communicate with the target VR.

Jumbo frames are network-layer PDUs (Protocol Data Unit) that have a size much larger than the typical 1,500 byte Ethernet MTU. Anything above the 1500 MTU is called a jumbo frame. Jumbo frames need to be configured to work on the ingress and egress interface of each device along the end-to-end transmission path. Furthermore, all devices in the topology must also agree on the maximum jumbo frame size. If there are devices along the transmission path that have varying frame sizes, then you can end up with fragmentation problems. Also, if a device along the path does not support jumbo frames and it receives one, it will drop it.

Problems with MTU size reduction due to tunnels, IPsec encryption, and overlay protocols can degrade network performance. If you are using encapsulation technologies, then you should consider increasing the MTU size, particularly in the core of the network or WAN to avoid fragmentation and Path Maximum Transmission Unit Discovery (PMTUD) issues.

The benefits of jumbo frames can improve your network's performance. However, it is important to explore if and how your network devices support jumbo frames before you turn this feature on. Some of the biggest gains of using jumbo frames can be realized within and between data centers. But you should be cognizant of the fragmentation that may occur if those large frames try to cross a link that has a smaller MTU size.

Additional Information

VMware Site Recovery Manager - https://core.vmware.com/vmware-site-recovery-manager

vSphere Replication RPO Violations

Anti-Virus Agent in Firewall Stops Virtual Machine Replication

vSphere Replication Operations Run Slowly as the Number of Replications Increases

How to use iPerf to test bandwidth between source host and target vSphere Replication appliance (312678)


Impact/Risks:

vSphere Replication RPO (Recovery Point Objective) violations occur when the time interval between each replicated copy of virtual machines exceeds the predefined RPO limit. The RPO is the maximum acceptable period of data loss after a disaster occurs, and it is essential to ensure business continuity and avoid data loss. When the vSphere Replication RPO limit is exceeded, it may result in data loss, inconsistent or inaccurate records, and failure to restore the virtual machine to its previous state.

vSphere Replication RPO violations can lead to significant data loss and business disruption if not addressed promptly. Therefore, careful planning, monitoring, and adjustment of replication settings are essential to ensure RPO compliance and maintain business continuity.