Site pairing or reconnect fails with SocketTimeoutException in Protection & Recovery UI
search cancel

Site pairing or reconnect fails with SocketTimeoutException in Protection & Recovery UI

book

Article ID: 441855

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

  • The site pairing process and RECONNECT operations fail within the Protection & Recovery or the Disaster Recovery interface. 
  • This problem typically occurs in configurations where multiple Primary appliances are linked to a single Remote site appliance containing several SRM-only appliances.

Environment

  • Protection & Recovery 9.x
  • VMware Live Recovery 9.0.x
  • vSphere Replication

Cause

  • The failure is rooted in the pairing workflow's use of the HmsRemoteSiteManager.getHmsInfo() method, which checks configured HMS sites one after another.
  • In complex environments with many site pairs (e.g., 7 or 8), the total time required for these sequential health checks often surpasses the default 30-second socket timeout.
  • The HMS logs from the protected site will have the below errors:


com.vmware.srm.client.topology.client.view.availability.ExtensionServer$GetPairFailedException: Unable to retrieve pairs from extension server at https://Remote_Site_HMS:443/vrms. Unable to connect to HBR Management Server at https://Remote_Site_HMS:443/vrms. Reason: https://Remote_Site_HMS:443/vrmsinvocation failed with 
"java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-3822 [ACTIVE]"
        at com.vmware.srm.client.topology.impl.view.availability.ExtensionServerImpl.complete(ExtensionServerImpl.java:66)
        at com.vmware.srm.client.topology.impl.core.mxn.nodes.HmsNode.lambda$discoverNeighbours$1(HmsNode.java:79)
        at com.vmware.dr.ui.tools.reactive.impl.PromiseImpl$ErrorCompletion.complete(PromiseImpl.java:172)
        at com.vmware.dr.ui.tools.reactive.impl.PromiseImpl$Result.complete(PromiseImpl.java:43)
        at com.vmware.dr.ui.tools.reactive.impl.PromiseImpl$Completion.lambda$setResult$0(PromiseImpl.java:63)
        at com.vmware.dr.ui.tools.utilities.ThreadContext.lambda$wrap$1(ThreadContext.java:55)
        at com.vmware.dr.ui.tools.utilities.ThreadContext.execute(ThreadContext.java:209)
        at com.vmware.dr.ui.tools.utilities.ThreadContext.execute(ThreadContext.java:185)
        at com.vmware.dr.ui.tools.utilities.ThreadContext.setupContext(ThreadContext.java:76)
        at com.vmware.dr.ui.tools.utilities.ThreadContext.setupContext(ThreadContext.java:105)
        at com.vmware.dr.ui.tools.reactive.impl.PromiseImpl$Completion.lambda$setResult$1(PromiseImpl.java:63)
        at com.vmware.dr.ui.tools.utilities.AsyncConsumer$Worker.run(AsyncConsumer.java:38)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Resolution

  • A permanent fix is expected in a future release.
  • To workaround the issue, increase the socket timeout on all Primary site appliances and the main Remote site appliance by following these steps:
    1. Log in to each appliance via SSH as the root user.
    2. Go to the configuration directory: cd /opt/vmware/etc/dr-client/
    3. Create a backup of the properties file: cp h5dr.properties h5dr.properties.bak
    4. Open h5dr.properties and find the socketTimeout entry.
    5. Change the value from 30000 to 180000 (extending the timeout to 3 minutes).
    6. Save changes and restart the client service: systemctl restart dr-client
    7. In the vCenter plugin, select VIEW DETAILS for each pair and provide the necessary remote site credentials.