vSphere Replication 9.x with Large Scale ESXI hosts using Enhanced Replication
Live Site Recovery 9.0.3, 9.0.4
A known issue exists where the health checks for used for Enhanced replication mappings are not properly closed, leading to an accumulation of "Established" connections from ESXi hosts until the 20,000 limit is reached. This issue will be seen most often in environments that have a large amount of ESXI hosts on both source and destination sites.
To verify the amount of open sockets:
netstat -n | grep tcp | awk '{sub(/:[0-9]+$/, "", $5); print $5}' | sort | uniq -c
/opt/vmware/hms/logs/hms-stderr.log will show similar entries:
SEVERE: Socket accept failed
java.io.IOException: Too many open files
at java.base/sun.nio.ch.Net.accept(Native Method)
at java.base/sun.nio.ch.ServerSocketChannelImpl.implAccept(Unknown Source)
This issue has been resolved in vSphere Replication 9.0.2.3
This issue has been resolved in Live Site Recovery 9.0.5
The following workaround is available if an upgrade can not be performed:
Create a snapshot of the vSphere Replication Appliance.
Establish an SSH session to the appliance and log in as root.
Stop the HMS service: systemctl stop hms
Open the configuration file for editing: vi /opt/vmware/hms/conf/hms-configuration.xml
Locate the schedule-health-checks section and modify the value to false: <schedule-health-checks>false</schedule-health-checks>
Save and close the file.
Wait approximately 2 minutes to allow the operating system to clear the leaked socket state.
Start the HMS service: systemctl start hms
Note: Disabling health checks will cause enhanced replication mappings to display an error status in the UI, but replication functionality will remain stable.