Performing LWD-based snapshot sync fails

Products

VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0

Issue/Introduction

1. Performing LWD-based snapshot sync fails

2. Unable to svMotion some VMs, the virtual machine freezes and becomes unresponsive and the task fails at 40%

3. Unable to backup VMs using Dell PPDM (PowerProtect Data Manager)

Task Name : Perform LWD-based snapshot sync
Status : Cannot complete the operation. See the event log for details. Failed to transport snapshot data
Initiator : com.vmware.dp
Target : VMware-VM
Server : vCenter.broadcom.local
Error stack: Failed to transport snapshot data

VMware.log:

2025-02-27T13:54:07.627Z In(05) worker-2653543 1381770e LWD: Preparing live migration type=SvMotion on VMName_VM
2025-02-27T13:54:07.627Z In(05) worker-2653543 1381770e LWD: Preparing live migration type=SvMotion on VMName_VM
2025-02-27T13:54:07.638Z In(05) vmx - SVMotion: Enter Phase 1
2025-02-27T13:54:07.729Z In(05) worker-2653550 - SVMotion: Enter Phase 2
2025-02-27T13:54:07.730Z In(05) worker-2653550 - SVMotionDiskGetCreateExtParams: not using a storage policy to create disk '/vmfs/volumes/5e6136e8-########-####-a0369f19c094/Datastore/VMName_VM.vmdk'
2025-02-27T13:54:09.356Z In(05) worker-2653550 7283fbbf SVMotionDiskGetCreateExtParams: not using a storage policy to create disk '/vmfs/volumes/5e6136e8-########-####-a0369f19c094/Datastore/VMName_VM.vmdk'
2025-02-27T13:54:10.994Z In(05) worker-2653550 7283fbbf SVMotion: Enter Phase 3
2025-02-27T13:54:11.169Z In(05) worker-2653550 7283fbbf SVMotionLocalDiskQueryInfo: Got block size 1048576 for filesystem VMFS.
2025-02-27T13:54:11.348Z In(05) worker-2653550 7283fbbf SVMotionLocalDiskQueryInfo: Got block size 1048576 for filesystem VMFS.
2025-02-27T13:54:11.348Z In(05) worker-2653550 7283fbbf SVMotion: Enter Phase 4
2025-02-27T13:54:11.450Z In(05) worker-2653550 7283fbbf SVMotion: Enter Phase 5
2025-02-27T13:54:11.450Z In(05) worker-2653550 7283fbbf SVMotion: Enter Phase 6
2025-02-27T13:54:11.463Z In(05) worker-2653543 7283fbbf SVMotion: Enter Phase 7
2025-02-27T13:54:11.468Z In(05) worker-2653547 7283fbbf SVMotion: Enter Phase 8
2025-02-27T13:59:40.292Z In(05) vmx - SVMotion: scsi0:1: Disk copy completed for total 245760 MB at 765326 kB/s.
2025-02-27T14:02:40.375Z Wa(03) vmx - SVMotion: scsi0:0: Disk transfer rate slow: 0 kB/s over the last 10.01 seconds, copied total 62080 MB at 353005 kB/s.
2025-02-27T14:03:03.036Z In(05) vmx 627d3af4 SVMotion: Enter Phase 12
2025-02-27T14:03:03.036Z In(05) vmx 627d3af4 SVMotion_Cleanup: Scheduling cleanup thread.
2025-02-27T14:03:03.036Z Wa(03) worker-2653547 7283fbbf SVMotionMirroredModeThreadDiskCopy: Found internal error when woken up on diskCopySemaphore. Aborting storage vmotion.
2025-02-27T14:03:03.036Z In(05) worker-2653543 627d3af4 SVMotionCleanupThread: Waiting for SVMotion Bitmap thread to complete.
2025-02-27T14:03:03.036Z In(05) worker-2653543 627d3af4 SVMotionCleanupThread: Waiting for SVMotion thread to complete.
2025-02-27T14:03:03.036Z Wa(03) worker-2653547 7283fbbf SVMotionCopyThread: disk copy failed. Canceling Storage vMotion.
2025-02-27T14:03:03.036Z In(05) worker-2653547 7283fbbf SVMotionCopyThread: Waiting for SVMotion Bitmap thread to complete before issuing a stun during migration failure cleanup.
2025-02-27T14:03:03.037Z In(05) worker-2653547 7283fbbf SVMotion: FailureCleanup thread completes.
2025-02-27T14:03:03.037Z In(05) worker-2653543 7283fbbf SVMotion: Worker thread performing SVMotionCopyThreadDone exited.
2025-02-27T14:03:03.037Z In(05) worker-2653543 - SVMotionCleanupThread: Waiting for the cleanup semaphore to be signaled so that it is safe for the cleanup thread to proceed.
2025-02-27T14:03:05.043Z In(05) vmx 7283fbbf [msg.svmotion.fail.internal] A fatal internal error occurred. See the virtual machine's log for more details.
2025-02-27T14:03:05.043Z In(05) vmx 7283fbbf [msg.svmotion.disk.copyphase.failed] Failed to copy one or more disks.

Read IOs are aborted by VMkernel.

2025-03-07T03:03:26.811Z Er(02) Upcall-1cd224cc 4f2e9509 LWD: Failed to read extent range [248372,248372], offset range [65109229568, 262144], from disk 4095126750 (capacity 214748364800), readTxn 142. Error: 15: IO was aborted
2025-03-07T03:03:30.369Z In(05) worker-3581520 35be92ef-92f0 LWD: Handling FinishFullSync message for disk '87a63c76-####-####-####-0867148aeab2', sync 61181b23-aa75-405e-5c39-141fe3ed543e success 'false'

VMkernel.log:

Write commands fails with busy errors from the target

2025-03-07T03:03:15.541Z In(182) vmkernel: cpu13:2098156)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x45b9bfc576c0, 3552458) to dev "naa.################################" on path "vmhba2:C0:T2:L114" Failed:
2025-03-07T03:03:15.541Z In(182) vmkernel: cpu13:2098156)NMP: nmp_ThrottleLogForDevice:3898: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0x3. Act:FAILOVER. cmdId.initiator=0x431cbe16c930 CmdSN 0x431cd6c042d0
2025-03-07T03:03:15.541Z Wa(180) vmkwarning: cpu13:2098156)WARNING: NMP: nmp_DeviceRetryCommand:130: Device "naa.################################": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2025-03-07T03:03:16.528Z Wa(180) vmkwarning: cpu1:2097944)WARNING: NMP: nmpDeviceAttemptFailover:644: Retry world failover device "naa.################################" - issuing command 0x45b9bfc576c0

Sense Key	[0x2]	NOT READY
Additional Sense Data	04/03	LOGICAL UNIT NOT READY, MANUAL INTERVENTION REQUIRED
OP Code	0x28	READ(10)

The command is retried multiple times but the target returns busy

2025-03-07T03:03:20.080Z Wa(180) vmkwarning: cpu18:2098156)WARNING: NMP: nmpCompleteRetryForPath:356: Retry cmd 0x28 (0x45b9bfc576c0) to dev "naa.################################" failed on path "vmhba2:C0:T2:L114" H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0x3.
2025-03-07T03:03:20.080Z Wa(180) vmkwarning: cpu18:2098156)WARNING: NMP: nmpCompleteRetryForPath:391: Logical device "naa.################################": awaiting fast path state update before retrying failed command again...

Eventually the command is aborted as part of virt reset

2025-03-07T03:03:26.811Z Wa(180) vmkwarning: cpu0:2098156)WARNING: NMP: nmpCompleteRetryForPath:356: Retry cmd 0x28 (0x45b9bfc576c0) to dev "naa.################################" failed on path "vmhba2:C0:T2:L114" H:0x8 D:0x0 P:0x0 .
2025-03-07T03:03:26.811Z Wa(180) vmkwarning: cpu0:2098156)WARNING: NMP: nmpCompleteRetryForPath:443: Retry world restored device "naa.################################" - no more commands to retry
2025-03-07T03:03:26.811Z Wa(180) vmkwarning: cpu0:2098156)WARNING: NMP: nmpCompleteRetryForPath:457: NMP device "naa.################################": requested fast path state update...
2025-03-07T03:03:26.811Z In(182) vmkernel: cpu0:2098156)ScsiDeviceIO: 4591: Cmd(0x45b9bfc576c0) 0x28, cmdId.initiator=0x431cbe16c930 CmdSN 0x431cd6c042d0 from world 3552458 to dev "naa.################################" failed H:0x8 D:0x0 P:0x0 Cancelled from driver
2025-03-07T03:03:26.811Z In(182) vmkernel: cpu0:2098156)layer

Host Status

[0x8]

RESET

This status is returned when the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.

2025-03-01T02:44:00.779Z In(182) vmkernel: cpu0:2097235)ScsiDeviceIO: 4656: Cmd(0x45d9aa86ba40) 0x89, cmdId.initiator=0x4309476f3a80 CmdSN 0x8b4fb from world 2097224 to dev "naa..################################" failed H:0x5 D:0x0 P:0x0 Cancelled from device layer.
2025-03-01T02:44:00.779Z In(182) vmkernel: cpu0:2097235)Cmd count Active:1 Queued:24 
2025-03-01T02:47:21.784Z In(182) vmkernel: cpu1:2097237)ScsiDeviceIO: 4656: Cmd(0x45d9f5d5a900) 0x89, cmdId.initiator=0x4309476f3a80 CmdSN 0x8b701 from world 2097224 to dev "naa..################################" failed H:0x5 D:0x0 P:0x0 Cancelled from device layer.
2025-03-01T02:47:21.784Z In(182) vmkernel: cpu1:2097237)Cmd count Active:1 Queued:20 
2025-03-03T02:53:27.141Z In(182) vmkernel: cpu0:2097236)ScsiDeviceIO: 4656: Cmd(0x45d9d32c9180) 0x89, cmdId.initiator=0x4309476f3a80 CmdSN 0xc7593 from world 2097224 to dev "naa..################################"failed H:0x5 D:0x0 P:0x0 Cancelled from device layer.
2025-03-03T02:53:27.141Z In(182) vmkernel: cpu0:2097236)Cmd count Active:1 Queued:20

Host Status	[0x5]	ABORT	This status is returned if the driver has to abort commands in-flight to the target. This can occur due to a command timeout or parity error in the frame.
OP Code	0x89	COMPARE AND WRITE

Environment

VMware vSphere 7.x
VMware vSphere 8.x

Cause

Reviewing the errors recorded in the logs, it seems that the Storage array is experiencing difficulties in processing ATS and I/O commands, which is resulting in the aforementioned issues.

Resolution

Please reach out to your Storage vendor to raise a case for further investigation and analysis.