Enhanced Replication configuration fails with the error: "Fault occurred while performing health check. Details: 'Connect: Input/output error"

Products

VMware Live Recovery

Issue/Introduction

Symptoms:

When running a Enhanced Replication Mappings Test, the following error is thrown:
- Fault occurred while performing health check. Details: 'Connect: Input/output error"
Each time the an Enhanced Replication Mappings Test is executed, only 1 ESXi host on the Target reports a 'Good' connection status
- The ESXi host on the Target side showing a 'Good' connection status will switch to a different host each time the Test is ran
In VMware Live Recovery 9.0.4 appliance, the following error may be seen in the Enhanced Replication Mappings connection test for all (or a few hosts):
"The source host (id:'host-##', name: '##################') successfully connected to the target broker '##########', but failed to establish a TLS connection between the source host '############' and the target host (id: 'host-##' name: '#############'). The server mappings might not have been updated for the source host '#############' on the target broker '#########' or the target host '###############' certificate has expired. Details: 'Connect: Connection reset by peer'."
The netcat and openssl commands for port 32032 show the below errors, even though there is no firewall rule blocking the port:

[root@source_esx:~] nc -zv <target_esx> 32032
nc: connect to <target_esx> 443 port (tcp) failed: Connection refused

[root@source_esx:~] openssl s_client -connect <target_esx>:32032
80BBF83474000000:error:8000006F:system library:BIO_connect:Connection
refused:crypto/bio/bio_sock2.c:114:calling connect()
80BBF83474000000:error:10000067:BIO routines:BIO_connect:connect error:crypto/bio/bio_sock2.c:116:
connect:errno=111

Validation:

To further investigate, analyze /var/run/log/hbrsrv.log on the ESXi host and identify a network issue indicated by "Dropping error encountered from network" messages.

Observed multiple errors related to client connection failures and dropped connections:
Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main] HbrError stack:
Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main] [0] ClientConnection (client=[target_esxi_ip]:52928) request callback failed: Failed to read: End of file
Er(163) hbrsrv[6530583]: [Originator@6876 sub=Main] [1] Dropping error encountered from network
In(166) hbrsrv[6530577]: [Originator@6876 sub=Delta] HbrSrv cleaning out ClientConnection ([target_esxi_ip]:52928)
In(166) hbrsrv[6530583]: [Originator@6876 sub=StatsLog] HbrEvent: {"clientAddress":"[target_esxi_ip]:52928","eventID":"lwdConnectionReset","groupID":"","serverID":"00000010-0000-0000-0400-000000000000","vimHostName":"vrep_FQDN","hbrEvent":1}
In(166) hbrsrv[6530583]: [Originator@6876 sub=Delta] Destroying client connection (ClientCnx '[target_esxi_ip]:52928' id=0 <shut> <clsd> <uninit>)
In(166) hbrsrv[6530582]: [Originator@6876 sub=Delta] ClientConnection (ClientCnx '[target_esxi_ip]:49152' id=0 <shut> <uninit>) is stopping ...
Validate the "Broken pipe" errors and "Connection reset" messages in /var/run/log/hbr-agent.log.

In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] SSL handshake failed: Connection reset by peer
In(166) hbr-agent-bin[6531120]: [0x000000bb7ed16700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Failed to connect to server target_esxi_ip:32032 using broker info: Connection reset by peer
In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [Proxy [Group: PING-GID-6a0e71e9-01de-450c-9a40-fdc078e34e48] -> [target_esxi_ip:32032]] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Exhausted all server endpoints reported by broker.
In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] info: [RESTRequest] [AppPing] [vrep_ipaddress:51152] [b8eeb1b3-6ad8-494b-b9d9-43ec06465c50-HMS-1355] Completing with OK
In(166) hbr-agent-bin[6531120]: [0x000000bb7ec95700] error: [RESTConnection] Error writing response: Broken pipe
Review /opt/vmware/hms/logs/hms.log for repeated communication failures over port 32032.:

ERROR com.vmware.hms.net.HbrAgentHealthMonitorService [hms-main-thread-25] (..hms.net.HbrAgentHealthMonitorService) [] | Error occurred while executing ping test call for group 'PING-GID-4bcc4b64-ace7-4434-9761-732d228a8b5b', broker 'vrep_ipaddress', broker port '32032' from host 'target_esxi_ip'.

Environment

VMware ESXi 8.x

vSphere Replication 9.x

VMware Live Recovery 9.x

Cause

The issue occurs because the source and destination ESXi hosts cannot establish a stable data connection due to a network failure.

The MTU 1500 and 9000 ping test fails between the source and target ESXi hosts.

[root@ esx0001:~ ] vmkping -I vmkx target_esxihost -d -s 8972
PING target_esxihost (target_esxihost) : 8972 data bytes

---- target_esxihost ping statistics
3 packets transmitted, 0 packets received, 100% packet loss

root@ esx0001:~ ] vmkping -I vmkx target_esxihost -d -s 1472
PING target_esxihost (target_esxihost): 1472 data bytes

-- target_esxihost ping statistics ---
packets transmitted, 0 packets received, 100% packet loss

Note:
In Enhanced Replication, data traffic flows directly between the source and target ESXi hosts over the WAN. With both hosts configured for MTU 9000, the Maximum Segment Size (MSS) becomes too large for the WAN, resulting in data packet loss.

Resolution

Recommended to involve the networking team to resolve network connection failure for 9000 MTU between source and destination ESXi hosts.

Else, configure the replication vmkernel adapters to use 1500 MTU on source and target ESX hosts.

Additionally,

Use an isolated network for vSphere Replication traffic, setting MTU to 1500 or 9000 as required.
Isolating Replication traffic prevents network congestion and ensures optimal performance - Isolating the Network Traffic of vSphere Replication.

Additional Information

Reference Link - The test for Enhanced Replication mapping is stuck, and the replication remains in a "Not Active" state.