Re-protect of recovery plan hangs and fails with error "Failed to rescan HBAs on host <host_fqdn/ip> – The connection to the remote server is down. Operation timed out: 300 seconds"

search cancel

Re-protect of recovery plan hangs and fails with error "Failed to rescan HBAs on host <host_fqdn/ip> – The connection to the remote server is down. Operation timed out: 300 seconds"

book

Article ID: 421587

calendar_today

Updated On:

Products

VMware Live Recovery VMware vSphere ESX 8.x VMware vSphere ESXi

Issue/Introduction

Symptoms:

During the Reprotect phase following a successful failover, the recovery plan hangs while the VMware Live Recovery (VLR) appliance attempts to initiate an HBA rescan on the new target site hosts, with the error:

"Failed to rescan HBAs on host <target_host> – The connection to the remote server is down. Operation timed out: 300 seconds"
If Test Recovery or Planned Migration is attempted, the recovery stalls at the same task.
Re-protect or recovery to a different cluster works with no issues.
If the affected hosts are rebooted, and if re-protect is run, it completes successfully.

Environment

VMware Live Recovery 9.x with Array Based Replication

VMware ESXi 8.x with Fibre Channel Storage

Cause

During the re-protect phase, the ESX host HBA rescan initiated by VMware Live Recovery (VLR) times out on the host.
This is caused due to the host's StorageFPIN module running out of memory.
From the VLR appliance's /var/log/vmware/srm/vmware-dr.log, it can be seen that the ESX host's HBA rescan operation times out:
--> (dr.recovery.RecoveryResult) {
--> runKey = #####,
--> operation = "reprotect",
--> options = (dr.recovery.RecoveryOptions) {
--> plan = 'dr.recovery.RecoveryPlan:deb17d65-####-####-8b68-############:814559b1-####-####-8b45-############',
--> planName = "##############",
--> planDescription = "",
--> user = "####.####\Administrator",
--> startTime = "2025-11-18T12:35:19.628697Z",
--> stopTime = "2025-11-18T12:41:08.139554Z",
--> executionTimeInSeconds = 348,
.
.
.
--> msg = "Operation timed out: 300 seconds"
--> },
--> faultMessage = <unset>
--> msg = "The connection to the remote server is down. Operation timed out: 300 seconds"
--> },
--> faultMessage = <unset>,
--> hostName = "##.##.##.##",
--> hostSystem = 'vim.HostSystem:814559b1-####-####-8b45-############:host-##'
--> msg = "Failed to rescan HBAs on host '##.##.##.##'. The connection to the remote server is down. Operation timed out: 300 seconds"
In the same log file on the VLR appliance, it can be seen that the ESXi host reports that an error occurred while communicating with the remote storage server:
2025-11-18T13:07:06.298Z warning vmware-dr[02917] [SRM@6876 sub=AbrRecoveryEngine opID=814559b1-####-####-8b45-############-test:tm:ge:g3:ad] RescanVmfsDone: Failed to rescanVmfs after resignature on host 'host-##': (dr.storage.fault.HostRescanFailed) {
--> faultCause = (vmodl.fault.HostCommunication) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>
--> msg = "Received SOAP response fault from [<SSL(<io_obj t:N7Vmacore6System19TCPSocketObjectAsioE, h:25, <TCP '##.##.##.## : 51480'>, <TCP '##.##.##.## : 443'>>), /sdk>]: rescanVmfs
--> An error occurred while communicating with the remote host."
--> },
--> faultMessage = <unset>,
--> hostName = "##.##.##.##",
--> hostSystem = 'vim.HostSystem:########-####-####-####-############:host-##'
--> msg = ""
From the impacted ESX host's var/run/log/vmkernel.log, it can be seen that the StorageFPIN module on the ESXi host has run out of memory:
2025-11-18T12:44:27.960Z Wa(180) vmkwarning: cpu18:2097477)WARNING: StorageFPIN: 521: Failed to allocate memory.
If this module runs out of memory, further rescans, storage path updates, LUN information updates will all fail.
StorageFPIN was added to ESXi 8.0U2 to be able to better handle fabric related issues. Due to a bug in the StorageFPIN code, when FPIN is unable to get enough memory, the rescan of the HBA will fail. Temporary/transient storage path loss on Host could result in paths not coming back.

Resolution

If a host is impacted by this issue, reboot the ESXi host and then re-attempt the re-protect.
The StorageFPIN memory issue has been resolved in ESXi 8.0U3e (24674464).
To prevent the rescan task time out issue during reprotect, upgrade the ESXi hosts to this version.
If an upgrade is not possible at the moment, it is recommended to disable FPIN on the ESXi hosts:
- ESXi 8.0 U3 and version below ESX 8.0 U3e (24674464)
  
  NOTE: This setting change does not require a reboot on its own however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.
  - Use the following command: esxcli storage fpin info set -e false
  - To confirm the setting: esxcli storage fpin info get
- ESXi 8.0 U2 and prior:
  - Use the following command: vsish -e set /storage/fpin/info 0
    
    NOTE: This vsish command is NOT persistent across reboots. Thus we recommend upgrading to ESXi 8.0 U3 and then disabling FPIN or reboot the host first and then run the command to disable fpin.

Additional Information

Temporary/transient storage path loss on Host could result in paths not coming back when using Cisco UCS and NFNIC.

Feedback

thumb_up Yes

thumb_down No