Incremental Sync Takes excessive Time to Complete When VM is Configured with Enhanced Replication

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

Incremental synchronization takes significantly longer than expected to complete
VMs configured with Enhanced Replication show increased lag during synchronization

Validation Steps :

Validated the Status of the VM replication using below commands in ESXI host where that VM resides

Get the VM ID:

vim-cmd vmsvc/getallvms (Make a note of VMID from this command)
Check the replication state:

vim-cmd hbrsvc/vmreplica.getState VMID
Check the replication progress.

[root@auh--esxi-:/vmfs/volumes/64abe83a-####-c40d-#####/log] vim-cmd hbrsvc/vmreplica.queryReplicationState 31
Querying VM running replication state:
Current replication state:
State: active
Instance ID: replica-####-ec05-46c6-###-#####
Progress: 0% (transfer: 5195923456/585975693312)
[root@auh-vxr-esxi-05:/vmfs/volumes/64abe83a-####-###-5c6f69dcf2f0/log]

Validate hbr-agent.log and check for below entries

Log path : Source ESXI host : less /var/run/log/hbr-agent.log

2025-04-25T08:17:30.534Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7d5c700] info: [Proxy [Group: GID-###-f27b-###-a033-####5457c6] -> [172.##.251.##:32032]] TCP Connect latency was 4482µs
2025-04-25T08:22:41.571Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] error: [Proxy [Group: GID-####aa-f27b-###-a033-####457c6] -> [##.18.##.#15:32032]] Failed to read from server: End of file

2025-04-25T08:24:25.039Z In(166) hbr-agent-bin[ 2103093]: [0x000000e7c7cdb700] info: [Proxy [Group: GID-####aa-f27b-####-a033-######c6] -> [1##.18.2##.54:32032]] Setting up secure tunnel to brokered server 1##.##.##1.1##:3203
2 (1 of 1)
2025-04-25T08:24:25.039Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7cdb700] info: [Proxy [Group: GID-####a-f27b-4e8e-a###-####7c6] -> [172.##.##.##:32032]] Bound to vmk: vmk2 for connection to 1##.18.2##.1##:32032
2025-04-25T08:24:25.042Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] info: [Proxy [Group: GID-ab####a-f27b-###-a033-######57c6] -> [172.##.####.###:32032]] TCP Connect latency was 3160µs
2025-04-25T08:24:41.407Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] error: [Proxy [Group: GID-###a-f27b-4e8e-###-####57c6] -> [172.##.###.##:32032]] Failed to read from client: Connection reset by peer
2025-04-25T08:24:41.407Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7c5a700] error: [Proxy [Group: GID-a####aa-f27b###-a033###57c6] -> [172.##.2##.1##:32032]] Failed to read from server: Operation canceled

Environment

VMware Live Recovery 9.0.2

Cause

From the source ESXi host, when attempting to use vmk2 with an MTU of 1500 (payload size 1472), there is 100% packet loss.
However, the same test using a reduced MTU (payload size 1072) completes successfully with no packet loss. This indicates there is a lag in network and its causing the slowness while replication data using MTU 1500.

Cause Validation

Network communication between the source and destination ESXi hosts over port 32032 (used by vSphere Replication) is broken when using MTU 1500.

[root@auh-vxr-esxi:/vmfs/volumes/64abe83a-####-###-5c6f69d####/log] vmkping -I vmk2 1##.##.##.### -d -s 1472
PING 1#2.##.##.##5 (1##.##.2##.##5): 1472 data bytes 172.18.251.115 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

Successful connectivity is only observed when MTU is reduced, which suggests fragmentation or intermediate network device issues (e.g., firewall, switch, or load balancer).

Whereas same ping is working fine with MTU 1072

[root@auh-esxi:/vmfs/volumes/64abe83a-####-c40d-######/log] vmkping -I vmk2 ###.##.##.## -d -s 1072
PING 1##.18.##1.##5 (1##.##.##1.##5): 1072 data bytes
1080 bytes from ##2.18.##1.##5: icmp_seq=0 ttl=60 time=3.626 ms
1080 bytes from ##2.18.##1.##5: icmp_seq=1 ttl=60 time=5.265 ms
1080 bytes from ##2.18.##1.##5: icmp_seq=2 ttl=60 time=4.380 ms

--- 172.18.251.115 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss

Resolution

Coordinate with the network team to investigate and resolve the MTU mismatch or path MTU discovery issue between the source and destination ESXi hosts.
Specifically, determine why packets using MTU 1500 (payload size 1472) are being dropped or not routed correctly.
Check for potential misconfigurations, MTU limitations, or faulty intermediate devices (e.g., switches, routers, or firewalls) affecting the replication traffic path.
Ensure consistent MTU settings end-to-end and enable jumbo frame support if required for optimal replication performance.