vMotion fails at 20% for MSCS/Microsoft WSFC VMs with RDM disks

Products

VMware vSphere ESXi

Issue/Introduction

Migration of WSFC virtual machines with RDM disk gets stuck at 20% and eventually fails.

Logs on the destination host -

hostd.log - /var/log/vmware/hostd.log

YYYY-MM-DDT08:16:10.011Z warning hostd[1234567] [Originator@1234 sub=Vmsvc.vm:/vmfs/volumes/UUID/TEST/TEST.vmx opID=abcd123-123456-auto-1new23-h5:xxxxxxxx-xx-xx-xx-xxxx user=vpxuser:admin\admin] PopulateCache failed: _diskAccess : false, _storageAccessible : true
YYYY-MM-DDT08:16:10.012Z warning hostd[1234567] [Originator@1234 sub=Vmsvc.vm:/vmfs/volumes/UUID/TEST/TEST.vmx opID=abcd123-123456-auto-1new23-h5:xxxxxxxx-xx-xx-xx-xxxx user=vpxuser:admin\admin] FetchUpdatedLayout: No cached layout files available. Doing a full fetch
YYYY-MM-DDT08:16:10.012Z warning hostd[1234567][Originator@1234 sub=Vmsvc.vm:/vmfs/volumes/UUID/TEST/TEST.vmx opID=abcd123-123456-auto-1new23-h5:xxxxxxxx-xx-xx-xx-xxxx user=vpxuser:admin\admin] CannotRetrieveCorefiles: VM disk access is turned off

vmkernel.log - var/log/vmware/vmkernel.log
YYYY-MM-DDT08:19:10.012Z cpu58:2097339)ScsiDeviceIO: 3484: Cmd(0x45cadfc718c0) 0x1a, CmdSN 0x1943355 from world 0 to dev "naa.6589cfc00000056ef3af090272007105" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
YYYY-MM-DDT08:19:10.012Z cpu56:2098449)WARNING: ScsiCore: 1851: Invalid sense buffer: error=0x0, valid=0x0
YYYY-MM-DDT08:19:10.012Z cpu56:2098449)NMP: nmp_ResetDeviceLogThrottling:3580: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x0 0x0 0x0 from dev "naa.###############################" occurred 2344 times(of 2344 commands)
YYYY-MM-DDT08:19:10.012Z cpu56:2098449)WARNING: ScsiCore: 1851: Invalid sense buffer: error=0x0, valid=0x0

Environment

VMware vSphere 8

Cause

Perennially reserved flag is not enabled on the LUNs which are connected to the WSFCVMs as Physical RDM access.

WSFC cluster nodes that are spread over several ESXi hosts require physical RDMs. The RDMs are shared among all hosts where cluster nodes run. The host with the active node holds persistent SCSI-3 reservations on all shared RDM devices.
When the active node is running and devices are locked, no other host can write to the devices. The same issue might also affect rescan operations.

Resolution

Enable the Perennially reserved flag status to TRUE on all the disks which are configured as Physical RDM on the Windows Clustered VMs

Command: esxcli storage core device setconfig -d naa.id --perennially-reserved=true

Refer: Change Perennial Reservation Settings

Note: Please ensure that the Perennially reserved flag is set to True for the RDM disks across all the ESXi hosts in the cluster

Additional Information

Validate the configuration of the WSFC Virtual Machines and the disks attached.

Pre-requisites for vMotion support:

vMotion is supported only for a cluster of virtual machines across physical hosts (CAB).
Do not migrate more than 8 WSFC virtual machines at the same time, for VMs with cluster shared resources. This may cause failover of cluster roles to other VMs.
The vMotion network must be a 10Gbps Ethernet link. 1Gbps Ethernet link for vMotion of WSFC virtual machines is not supported.
vMotion is supported for Windows Server 2012 and above releases. Windows Server 2008 SP2 and earlier are not supported.
The WSFC cluster heartbeat time-out must be modified at least to the values listed below:
(get-cluster -name <cluster-name>).SameSubnetThreshold = 10
(get-cluster -name <cluster-name>).CrossSubnetThreshold = 20
(get-cluster -name <cluster-name>).RouteHistoryLength = 40
The virtual hardware version for the WSFC virtual machine must be version 11 and later.

Reference:

Caution: Please do not enable the "Perennially reserved flag = True" for disks associated with VMFS volumes. Device naa. with a VMFS partition is marked perennially reserved in the vmkernel log