The test for Enhanced Replication mapping is stuck, and the replication remains in a "Not Active" state.
search cancel

The test for Enhanced Replication mapping is stuck, and the replication remains in a "Not Active" state.

book

Article ID: 388697

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • When attempting to add a replication mapping, the Test process keeps running and does not complete.



  • Unable to configure enhanced replication for virtual machines and it fails with the error:
    "A replication error occurred at the vSphere Replication Server for replication 'vm_name'. Details: 'No connection to VR Server for virtual machine test on host source-esxihost in cluster management in cloud: Unknown

 

Validation:

  • Further to the investigation analyze /var/run/log/hbrsrv.log on the ESXi host and identify a network issue indicated by "Dropping error encountered from network" messages.

Observed multiple errors related to client connection failures and dropped connections:
YYYY-MM-DDTHH:MM:SS.SSSSZ Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main] HbrError stack:
YYYY-MM-DDTHH:MM:SS.SSSSZ Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main]    [0] ClientConnection (client=[target_esxi_ip]:52928) request callback failed: Failed to read: End of file
YYYY-MM-DDTHH:MM:SS.SSSSZ Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main]    [1] Dropping error encountered from network
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbrsrv[6530577]: [Originator@6876 sub=Delta] HbrSrv cleaning out ClientConnection ([target_esxi_ip]:52928)
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbrsrv[6530583]: [Originator@6876 sub=StatsLog] HbrEvent: {"clientAddress":"[target_esxi_ip]:52928","eventID":"lwdConnectionReset","groupID":"","serverID":"00000010-0000-0000-0400-000000000000","vimHostName":"vrep_FQDN","hbrEvent":1}
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbrsrv[6530583]: [Originator@6876 sub=Delta] Destroying client connection (ClientCnx '[target_esxi_ip]:52928' id=0 <shut> <clsd> <uninit>)
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbrsrv[6530582]: [Originator@6876 sub=Delta] ClientConnection (ClientCnx '[target_esxi_ip]:49152' id=0 <shut> <uninit>) is stopping ...

  • Validate the "Broken pipe" errors and "Connection reset" messages in /var/run/log/hbr-agent.log.

YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] SSL handshake failed: Connection reset by peer
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Failed to connect to server target_esxi_ip:32032 using broker info: Connection reset by peer
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Exhausted all server endpoints reported by broker.
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] info: [RESTRequest] [AppPing] [vrep_ipaddress:51152] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Completing with OK
YYYY-MM-DDTHH:MM:SS.SSSSZ In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [RESTConnection] Error writing response: Broken pipe

  • Review /opt/vmware/hms/logs/hms.log for repeated communication failures over port 32032.:

YYYY-MM-DDTHH:MM:SS.SSSS ERROR com.vmware.hms.net.HbrAgentHealthMonitorService [hms-main-thread-25] (..hms.net.HbrAgentHealthMonitorService) [] | Error occurred while executing ping test call for group 'PING-GID-4bcc4b64-ace7-4434-9761-732d228a8b5b', broker 'vrep_ipaddress', broker port '32032' from host 'target_esxi_ip'.

 

Environment

VMware ESXi 8.x
vSphere Replication 9.x

Cause

  • The MTU settings across the environment are not consistent.
  • The MTU 9000 ping test fails between the source and target ESXi hosts, while the MTU 1500 ping test succeeds.

  • In Enhanced Replication, data traffic flows directly between the source and target ESXi hosts over the WAN. With both hosts configured for MTU 9000, the Maximum Segment Size (MSS) becomes too large for the WAN, resulting in data packet loss.

Resolution

To resolve the issue, please follow the steps below:

  • Change the MTU to 1500 on the source and target ESXi hosts or work with the network team to resolve MTU-related issues.
  • Use an isolated network for vSphere Replication traffic, setting MTU to 1500 or 9000 as required.
  • Isolating Replication traffic prevents network congestion and ensures optimal performance.
    Reference Link:
    Isolating the Network Traffic of vSphere Replication