Windows SQL Cluster Loses Access to Passthrough RDM Disks After Storage APD

Products

VMware vSphere ESXi

Issue/Introduction

A 2-node Windows SQL cluster loses access to passthrough Raw Device Mapping (RDM) disks within the Guest OS. This issue occurs after the backend array experiences an All Paths Down (APD) event. The secondary node remains online but is missing the disks, and the Guest OS storage stack pauses disk operations and fails to automatically remount the volumes post-recovery.

Verification of the mapped Guest OS disks to the ESXi passthrough RDM pointers correlates the VMDK descriptors to the physical storage naa IDs:

[root@Hostname:/vmfs/volumes/########-########-####-############/VMName] cat VMName.vmx| grep vmdk
scsi0:0.fileName = "VMName.vmdk"
scsi1:0.fileName = "VMName_2.vmdk"

[root@Hostname:/vmfs/volumes/########-########-####-############/VMName] vmkfstools -q "VMName_2.vmdk"
Disk VMName_2.vmdk is a Passthrough Raw Device Mapping
Maps to: vml.######################################################

[root@Hostname:/vmfs/volumes/########-########-####-############/VMName] esxcli storage core device list -d vml.######################################################
..
..
Display Name: ################## (naa.################################)

Host logs (vmkernel.log) confirm the RDM naa IDs entered an APD state simultaneously, and the host was unable to drop SCSI-2 reservations during the failover attempt. Further logs show that although the physical devices subsequently exited the APD state, the SCSI-2 reservations remained uncleared, leaving the Guest OS indefinitely disconnected:

####-##-##T##:##:##.###Z cpu##:#######)WARNING: NMP: nmp_DeviceUpdatePathStates:####: Could not drop reservation on failover for NMP device "naa.################################".
####-##-##T##:##:##.###Z cpu#:#######)StorageApdHandlerEv: ###: Device or filesystem with identifier [naa.################################] has entered the All Paths Down state.
####-##-##T##:##:##.###Z cpu#:#######)StorageApdHandlerEv: ###: Device or filesystem with identifier [naa.################################] has exited the All Paths Down state.
####-##-##T##:##:##.###Z cpu#:#######)WARNING: NMP: nmpDeviceTaskMgmt:####: Attempt to issue lun reset on device naa.################################. This will clear any SCSI-2 reservations on the device.

Environment

ESXi

Raw Device Mapping (RDM)

Cause

An All Paths Down (APD) condition on the physical storage array severed access to the RDM devices. While the backend storage array eventually exited the APD state, prolonged I/O timeouts during the event forced the Windows Guest OS to drop the disks and fail the SCSI commands, preventing automatic recovery and leaving stale SCSI-2 reservations on the host.

Resolution

Reboot the affected Windows Guest OS cluster node (after securing appropriate maintenance approvals). Rebooting flushes the stalled SCSI miniport driver in Windows, forcing a full hardware rescan to re-establish the severed SCSI-2 reservations.
Investigate the OS layer specifically if the disks do not appear in Windows Disk Management post-reboot.
Engage your storage administration team to isolate and address the root cause of the SAN fabric APD to prevent future occurrences.