Incremental Sync Takes excessive Time to Complete When VM is Configured with Enhanced Replication
search cancel

Incremental Sync Takes excessive Time to Complete When VM is Configured with Enhanced Replication

book

Article ID: 398107

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

  • Incremental synchronization takes significantly longer than expected to complete
  • VMs configured with Enhanced Replication show increased lag during synchronization

Validation Steps :

Validated the Status of the VM replication using below commands in ESXI host where that VM resides 

  1. Get the VM ID:

     vim-cmd vmsvc/getallvms  (Make a note of VMID from this command)
     
  2. Check the replication state:

    vim-cmd hbrsvc/vmreplica.getState VMID

  3. Check the replication progress.

           [root@auh--esxi-:/vmfs/volumes/64abe83a-####-c40d-#####/log] vim-cmd hbrsvc/vmreplica.queryReplicationState 31
           Querying VM running replication state:
           Current replication state:
           State: active
           Instance ID: replica-####-ec05-46c6-###-#####
           Progress: 0% (transfer: 5195923456/585975693312)
           [root@auh-vxr-esxi-05:/vmfs/volumes/64abe83a-####-###-5c6f69dcf2f0/log]

Validate hbr-agent.log and check for below entries 

   Log path : Source ESXI host :  less /var/run/log/hbr-agent.log 


         2025-04-25T08:17:30.534Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7d5c700] info: [Proxy [Group: GID-###-f27b-###-a033-####5457c6] -> [172.##.251.##:32032]] TCP Connect latency was 4482µs
          2025-04-25T08:22:41.571Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] error: [Proxy [Group: GID-####aa-f27b-###-a033-####457c6] -> [##.18.##.#15:32032]] Failed to read from server: End of file

         2025-04-25T08:24:25.039Z In(166) hbr-agent-bin[ 2103093]: [0x000000e7c7cdb700] info: [Proxy [Group: GID-####aa-f27b-####-a033-######c6] -> [1##.18.2##.54:32032]] Setting up secure tunnel to brokered server 1##.##.##1.1##:3203
         2 (1 of 1)
        2025-04-25T08:24:25.039Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7cdb700] info: [Proxy [Group: GID-####a-f27b-4e8e-a###-####7c6] -> [172.##.##.##:32032]] Bound to vmk: vmk2 for connection to 1##.18.2##.1##:32032
        2025-04-25T08:24:25.042Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] info: [Proxy [Group: GID-ab####a-f27b-###-a033-######57c6] -> [172.##.####.###:32032]] TCP Connect latency was 3160µs
        2025-04-25T08:24:41.407Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7bd9700] error: [Proxy [Group: GID-###a-f27b-4e8e-###-####57c6] -> [172.##.###.##:32032]] Failed to read from client: Connection reset by peer
        2025-04-25T08:24:41.407Z In(166) hbr-agent-bin[2103093]: [0x000000e7c7c5a700] error: [Proxy [Group: GID-a####aa-f27b###-a033###57c6] -> [172.##.2##.1##:32032]] Failed to read from server: Operation canceled

Environment

VMware Live Recovery 9.0.2 

 

Cause

  • From the source ESXi host, when attempting to use vmk2 with an MTU of 1500 (payload size 1472), there is 100% packet loss.

  • However, the same test using a reduced MTU (payload size 1072) completes successfully with no packet loss. This indicates there is a lag in network and its causing the slowness while replication data using MTU 1500. 

Cause Validation

Network communication between the source and destination ESXi hosts over port 32032 (used by vSphere Replication) is broken when using MTU 1500.

 
[root@auh-vxr-esxi:/vmfs/volumes/64abe83a-####-###-5c6f69d####/log] vmkping -I vmk2 1##.##.##.### -d -s 1472
      PING 1#2.##.##.##5 (1##.##.2##.##5): 1472 data bytes
      172.18.251.115 ping statistics ---
      3 packets transmitted, 0 packets received, 100% packet loss

  • Successful connectivity is only observed when MTU is reduced, which suggests fragmentation or intermediate network device issues (e.g., firewall, switch, or load balancer).

     
    Whereas same ping is working fine with MTU 1072 

    [root@auh-esxi:/vmfs/volumes/64abe83a-####-c40d-######/log] vmkping -I vmk2 ###.##.##.## -d -s 1072
    PING 1##.18.##1.##5 (1##.##.##1.##5): 1072 data bytes
    1080 bytes from ##2.18.##1.##5: icmp_seq=0 ttl=60 time=3.626 ms
    1080 bytes from ##2.18.##1.##5: icmp_seq=1 ttl=60 time=5.265 ms
    1080 bytes from ##2.18.##1.##5: icmp_seq=2 ttl=60 time=4.380 ms

    --- 172.18.251.115 ping statistics ---
    3 packets transmitted, 3 packets received, 0% packet loss

Resolution

  • Coordinate with the network team to investigate and resolve the MTU mismatch or path MTU discovery issue between the source and destination ESXi hosts.

  • Specifically, determine why packets using MTU 1500 (payload size 1472) are being dropped or not routed correctly.

  • Check for potential misconfigurations, MTU limitations, or faulty intermediate devices (e.g., switches, routers, or firewalls) affecting the replication traffic path.

  • Ensure consistent MTU settings end-to-end and enable jumbo frame support if required for optimal replication performance.