Reprotect Failure – Unable to Delete Stale LUN State on ESXi Host

Products

VMware Live Recovery

Issue/Introduction

During a reprotect operation, the following error was observed:

Cannot delete the state for SCSI LUN on host <ESXi_Host>.
The connection to the remote server is down. Operation timed out: 900 seconds

The recovery plan was stuck in the Reprotecting state.
The affected VM on the source site was powered on and running normally.

Sample example from Recovery Plan Report

Attempts to execute commands from the ESXi host at the recovery site slow due to host unresponsiveness.
SRM logs confirmed the failure during the recovery plan execution with a timeout of 900 seconds while attempting to delete the LUN state.

From /opt/vmware/support/logs/srm/vmware-dr.log

--> planName = "RP",
--> planDescription = "",
--> user = "VSPHERE.LOCAL\dr",
--> startTime = "2025-10-20T03:58:55.183991Z",
--> stopTime = "2025-10-20T04:16:19.969144Z",
--> executionTimeInSeconds = 1045,
--> totalPausedTimeInSeconds = 0,
--> resultState = "errors",
--> warningCount = 0,
--> errorCount = 1,
--> poweredOnVms = 0,
--> errorStateVms = 0,
--> successfullyRecoveredVms = 1,
--> ipCustomizedVms = 0,
--> errorIpCustomizedVms = 0,
--> poweredOffVms = 0,
--> warnings = <unset>,
--> errors = (vmodl.MethodFault) [
--> (dr.storage.fault.DeleteScsiLunStateFault) {
--> faultCause = (dr.fault.ConnectionDownFault) {
--> faultCause = (dr.fault.Timedout) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> timeout = 900
--> msg = "Operation timed out: 900 seconds"
--> },
--> faultMessage = <unset>
--> msg = "The connection to the remote server is down. Operation timed out: 900 seconds"
--> },
--> faultMessage = <unset>,
--> hostName = "ESXI host",
--> host = 'vim.HostSystem:73###c80-####-####-####-f79######527:host',
--> lunCanonicalName = "naa.6000##########37"
--> msg = "Cannot delete the state for SCSI LUN 'naa.6000##########37' on host 'ESXI host"'. The connection to the remote server is down. Operation timed out: 900 seconds"
--> }

Even force-cleanup was not successful, got operation timed out 1247 seconds

Environment

VMware Live Site Recovery 9.0

Array based replication

Cause

The ESXi host experienced severe hardware-level errors on the attached datastore, resulting in:

Sluggish performance
Host unresponsiveness
Timeouts during the SRM reprotect operation

The host was unable to process commands or clean up stale LUN state due to repeated I/O failures.

Cause Justification

/var/log/hostd.log recorded extremely long I/O access times.

2025-10-20T03:58:55.643Z warning hostd[2280913] [Originator@6876 sub=IoTracker] In thread 2105284, access("/vmfs/volumes/68###3be-########-4492-b49#####05a4/catalog") took over 59247 sec.
2025-10-20T03:58:55.643Z warning hostd[2280913] [Originator@6876 sub=IoTracker] In thread 2105292, access("/vmfs/volumes/68###3be-########-4492-b49#####05a4/catalog") took over 76201 sec.
2025-10-20T03:58:55.643Z warning hostd[2280913] [Originator@6876 sub=IoTracker] In thread 2104362, access("/vmfs/volumes/68###3be-########-4492-b49#####05a4/catalog") took over 90742 sec.

/var/log/vmkernel.log showed repeated SCSI device errors with valid sense data indicating hardware faults.

2025-10-21T16:44:43.583Z cpu34:2098192)ScsiDeviceIO: 4167: Cmd(0x45######a108) 0x89, CmdSN 0x9b12 from world 59###03 to dev "naa.61c73##############012" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0
2025-10-21T16:44:43.584Z cpu42:5901803 opID=c8a50649)HBX: 1016: 'Datastore': HB at offset 45###4 - Setting pulse failed: I/O error:
2025-10-21T16:44:43.584Z cpu42:5901803 opID=c8a50649) [HB state abcdef02 offset 4521984 gen 3 stampUS 2353828222629 uuid 68###df8-########-d0ef-a03#####afba jrnl <FB 0> drv 24.82 lockImpl 4 ip]
2025-10-21T16:44:43.584Z cpu34:2098192)ScsiDeviceIO: 4167: Cmd(0x45######a108) 0x89, CmdSN 0x9b14 from world 59###03 to dev "naa.61c73##############012" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0

These issues directly impacted SRM’s ability to complete LUN state cleanup within the timeout window.

Resolution

Reboot the ESXi host to restore responsiveness.
Once the host is stable, run Force Cleanup on the Recovery Plan to clear the stale LUN state.

If still issue persists after these steps, contact Broadcom Support for further assistance