Virtual machine replication is slow using vSphere Replication

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Virtual machine replication is taking a long time to finish with no error displayed.

Regardless of whether legacy replication or enhanced replication is used, any VMs configured for replication are experiencing slowness.
If a recovery plan is executed while a virtual machine is in an active synchronization state, the recovery plan may fail due to the replication status being incomplete.

Validation:

The /var/log/vmware/hbrsrv.log on the Target VR indicates a network-related issue.

2025-01-28T12:16:30.108+05:30 verbose hbrsrv[2748298] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: serverStats, HbrServer. Applied change to temp map.
2025-01-28T12:16:32.365+05:30 info hbrsrv[2748277] [Originator@6876 sub=Delta] ClientConnection (ClientCnx '[Y.Y.Y.Y]:55936' id=3 <shut>) is stopping ...
2025-01-28T12:16:32.365+05:30 info hbrsrv[2748277] [Originator@6876 sub=Asio] Closing LWD ASIO -> (plain text)
2025-01-28T12:16:32.365+05:30 info hbrsrv[2748277] [Originator@6876 sub=Delta] HbrSrv cleaning out ClientConnection ([Y.Y.Y.Y]:55936)
2025-01-28T12:16:32.365+05:30 error hbrsrv[2748277] [Originator@6876 sub=Main] HbrError stack:
2025-01-28T12:16:32.365+05:30 error hbrsrv[2748277] [Originator@6876 sub=Main] [0] ClientConnection (client=[Y.Y.Y.Y]:55936) request callback failed: Connection reset by peer: The connection is terminated by the remote end with a reset packet. Usually, this is a sign of a network problem, timeout, or service overload.
2025-01-28T12:16:32.365+05:30 error hbrsrv[2748277] [Originator@6876 sub=Main] [1] Dropping error encountered from network

(Note: In above logs Y.Y.Y.Y is the Source ESXi vmkernel where replication traffic is enabled)

Environment

vSphere Replication 8.x
vSphere Replication 9.x

Cause

This can be caused by following reasons:

Poor replication bandwidth allocation between sites
Network configuration of external switches, routers, firewall and WAN appliances.
Network performance is poor or inconsistent.

Resolution

Engage the physical switch vendor to identify network bottlenecks affecting replication performance, ensure stable connectivity with sufficient bandwidth between source and target sites, and check for congestion or packet loss.

To perform packet captures, follow this article: Using the pktcap-uw tool in ESXi 5.5 and later (341568)

Following commands can be used to capture packets:

To capture packets on the uplink vmnic of source ESXi host where the VM is running:

# pktcap-uw --uplink vmnic --dir 2 -o /vmfs/volumes/Datastore_name/vmnic.pcap

To capture packets on the VMkernel interface used for replication traffic:

# pktcap-uw --vmk vmk0 --dir 2 -o /vmfs/volumes/Datastore_name/vmk2.pcap

To capture packets on the network adapter of the DR site replication appliance:

# tcpdump -i eth0 -w /tmp/eth0.pcap

These packet captures can be analyzed by the physical switch vendor for potential issues on the network such as out of order packets or tcp reset events.