Enhanced Replication configuration fails with the error: "Fault occurred while performing health check. Details: 'Connect: Input/output error"
search cancel

Enhanced Replication configuration fails with the error: "Fault occurred while performing health check. Details: 'Connect: Input/output error"

book

Article ID: 401002

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

Symptoms:

  • When running a Enhanced Replication Mappings Test, the following error is thrown:
    • Fault occurred while performing health check. Details: 'Connect: Input/output error"



  • Each time the an Enhanced Replication Mappings Test is executed, only 1 ESXi host on the Target reports a 'Good' connection status
    • The ESXi host on the Target side showing a 'Good' connection status will switch to a different host each time the Test is ran

 

Validation:

  • To further investigate, analyze /var/run/log/hbrsrv.log on the ESXi host and identify a network issue indicated by "Dropping error encountered from network" messages.

    Observed multiple errors related to client connection failures and dropped connections:
    Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main] HbrError stack:
    Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main]    [0] ClientConnection (client=[target_esxi_ip]:52928) request callback failed: Failed to read: End of file
    Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main]    [1] Dropping error encountered from network
    In(166) hbrsrv[6530577]: [Originator@6876 sub=Delta] HbrSrv cleaning out ClientConnection ([target_esxi_ip]:52928)
    In(166) hbrsrv[6530583]: [Originator@6876 sub=StatsLog] HbrEvent: {"clientAddress":"[target_esxi_ip]:52928","eventID":"lwdConnectionReset","groupID":"","serverID":"00000010-0000-0000-0400-000000000000","vimHostName":"vrep_FQDN","hbrEvent":1}
    In(166) hbrsrv[6530583]: [Originator@6876 sub=Delta] Destroying client connection (ClientCnx '[target_esxi_ip]:52928' id=0 <shut> <clsd> <uninit>)
    In(166) hbrsrv[6530582]: [Originator@6876 sub=Delta] ClientConnection (ClientCnx '[target_esxi_ip]:49152' id=0 <shut> <uninit>) is stopping ...

  • Validate the "Broken pipe" errors and "Connection reset" messages in /var/run/log/hbr-agent.log.

    In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] SSL handshake failed: Connection reset by peer
    In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Failed to connect to server target_esxi_ip:32032 using broker info: Connection reset by peer
    In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Exhausted all server endpoints reported by broker.
    In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] info: [RESTRequest] [AppPing] [vrep_ipaddress:51152] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Completing with OK
    In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [RESTConnection] Error writing response: Broken pipe


  • Review /opt/vmware/hms/logs/hms.log for repeated communication failures over port 32032.:

    ERROR com.vmware.hms.net.HbrAgentHealthMonitorService [hms-main-thread-25] (..hms.net.HbrAgentHealthMonitorService) [] | Error occurred while executing ping test call for group 'PING-GID-4bcc4b64-ace7-4434-9761-732d228a8b5b', broker 'vrep_ipaddress', broker port '32032' from host 'target_esxi_ip'.

Environment

VMware ESXi 8.x
vSphere Replication 9.x

Cause

  • The MTU settings across the environment are not consistent.
  • The MTU 9000 ping test fails between the source and target ESXi hosts, while the MTU 1500 ping test succeeds.





  • In Enhanced Replication, data traffic flows directly between the source and target ESXi hosts over the WAN. With both hosts configured for MTU 9000, the Maximum Segment Size (MSS) becomes too large for the WAN, resulting in data packet loss.

Resolution

To resolve the issue, please follow the steps below:

  • Change the MTU to 1500 on the source and target ESXi hosts or work with the network team to resolve MTU-related issues.
  • Use an isolated network for vSphere Replication traffic, setting MTU to 1500 or 9000 as required.
  • Isolating Replication traffic prevents network congestion and ensures optimal performance.
    Reference Link:
    Isolating the Network Traffic of vSphere Replication

Additional Information