Unable to connect to vSphere Replication Management Server: Connection Refused

search cancel

Unable to connect to vSphere Replication Management Server: Connection Refused

book

Article ID: 437468

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

vSphere Replication Management service is restarting frequently
Replications in Live Site Recovery interface are intermittently unavailable.
vSphere Replication may show as disconnected with the following error
"Unable to retrieve pairs from extension server at https://##.##.##.##:8043.Unable to connect to vSphere Replication Management Server at https://##.#.##.##:8043. Reason: https://##.##.##.##:8043 invocation failed with "java.net.ConnectException: Connection refused"
ESXI hosts may go unresponsive with error: "remote https connections exceed max allowed: 128"

Environment

vSphere Replication 9.x with Large Scale ESXI hosts using Enhanced Replication
Live Site Recovery 9.0.3, 9.0.4

Cause

A known issue exists where the health checks for used for Enhanced replication mappings are not properly closed, leading to an accumulation of "Established" connections from ESXi hosts until the 20,000 limit is reached. This issue will be seen most often in environments that have a large amount of ESXI hosts on both source and destination sites.

To verify the amount of open sockets:

netstat -n | grep tcp | awk '{sub(/:[0-9]+$/, "", $5); print $5}' | sort | uniq -c

/opt/vmware/hms/logs/hms-stderr.log will show similar entries:

 SEVERE: Socket accept failed
    java.io.IOException: Too many open files
    at java.base/sun.nio.ch.Net.accept(Native Method)
    at java.base/sun.nio.ch.ServerSocketChannelImpl.implAccept(Unknown Source)

Resolution

This issue has been resolved in vSphere Replication 9.0.2.3
This issue has been resolved in Live Site Recovery 9.0.5

Additional Information

The following workaround is available if an upgrade can not be performed:

Create a snapshot of the vSphere Replication Appliance.
Establish an SSH session to the appliance and log in as root.
Stop the HMS service: systemctl stop hms
Open the configuration file for editing: vi /opt/vmware/hms/conf/hms-configuration.xml
Locate the schedule-health-checks section and modify the value to false: <schedule-health-checks>false</schedule-health-checks>
Save and close the file.
Wait approximately 2 minutes to allow the operating system to clear the leaked socket state.
Start the HMS service: systemctl start hms

Note: Disabling health checks will cause enhanced replication mappings to display an error status in the UI, but replication functionality will remain stable.

Feedback

thumb_up Yes

thumb_down No