This issue affects SRM recoveries (Test/Planned Migration) using NVMe over Fabrics (NVMe-oF) including Virtual Volumes
1. SRM can't get host NQN through vCenter API.
2. When SRM attempts to populate the ESXi host information it is able to successfully capture the FC WWPNs associated with each host but cannot retrieve the NVMe NQNs that is associated with the NVMe backed datastores.
3. Due to the NQNs failing to be populated from the ESXi hosts the SRM server is unable to associate the datastores with the ESXi hosts and they are skipped for failover testing, which ultimately causes the error the customer sees within the GUI.
4. HostQualifiedName values are shown as null in the VC mob or while fetching them via VC APIsFailed to create snapshots of replica devices.
Failed to create snapshot of replica device peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-00.
Skipping failover operation for device 'peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-00' as initiators were missing for one or more hosts.
Failed to create snapshot of replica device peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-01.
Skipping failover operation for device 'peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-01' as initiators were missing for one or more hosts.
Failed to create snapshot of replica device peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-02.
Skipping failover operation for device 'peer-of-adedc11a-3c3d-4fcf-9a17-78befe5bf39b:VM-02' as initiators were missing for one or more hosts.
SRM only finds the FC HBAs and not the NVMe NQNs on the ESXi hosts:
2024-07-10T19:33:16.169-05:00 warning vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Cannot create NVMe initiator for host host-76783, because hosts NQN is not available.
2024-07-10T19:33:16.169-05:00 warning vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Cannot create NVMe initiator for host host-76837, because hosts NQN is not available.
2024-07-10T19:33:16.169-05:00 verbose vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Added initiators to access group 'domain-c18':
Which results in this error being thrown for all ESXi hosts:
2024-07-10T19:33:16.169-05:00 warning vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Failed to obtain initiators for NVMe access group 'domain-c18-nvme'
2024-07-10T19:33:16.169-05:00 warning vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Access group 'domain-c18-nvme' doesn't contain any initiators
2024-07-10T19:33:16.169-05:00 warning vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] Skipping empty access group 'domain-c18-nvme' for device 'peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-00' in testFailoverStart
2024-07-10T19:33:16.169-05:00 error vmware-dr[01090] [SRM@6876 sub=Storage opID=ca529e1a-56a0-471b-8649-d9ad8d881575-test:8863:2fdc:f3a1:0a5c] No initiators were found for device 'peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-00'. Excluding device from testFailoverStart.
VMware-dr.log:
2024-07-24T10:54:13.446-05:00 verbose vmware-dr[07036] [SRM@6876 sub=Replication opID=2b2a931a-23e1-4609-b21e-d8a684803627-test:e5f9:098c:9542:664a] EntityFailed: Received a failure update for protected VM Id=[dr.replication.ProtectedVm:ce3b1418-b993-40b5-92de-3f8ed4eb2177:protected-vm-1838359], error=
--> (dr.storageProvider.fault.StorageTestFailoverStartFailed) {
--> faultCause = (dr.fault.MultipleFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> faults = (vmodl.MethodFault) [
--> (dr.storageProvider.fault.StorageDeviceTestFailoverStartFailed) {
--> faultCause = (dr.storage.fault.HostsInitiatorsNotFound) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> id = "peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-00",
--> deviceType = "device",
--> host = (vim.HostSystem) [
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-71783',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-71795',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-71798',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-71801',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-71804',
--> ]
--> msg = ""
--> },
--> faultMessage = <unset>,
--> device = "peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-00"
--> msg = ""
--> },
--> (dr.storageProvider.fault.StorageDeviceTestFailoverStartFailed) {
--> faultCause = (dr.storage.fault.HostsInitiatorsNotFound) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> id = "peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-01",
--> deviceType = "device",
--> host = (vim.HostSystem) [
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-76713',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-76715',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-76718',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-76811',
--> 'vim.HostSystem:F16023DF-62A1-44E9-B66C-EA3330BC06BF:host-76814',
--> ]
--> msg = ""
--> },
--> faultMessage = <unset>,
--> device = "peer-of-adbdc11a-3c8d-4fcf-9a67-78befe5bf39b:VM-01"
--> msg = ""
--> },
SRA logs:
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Entering
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Querying array for volume VM-00-puresra-testFailover.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Volume VM-00-puresra-testFailover does not exist on array 1202b08a-1593-4006-ba36-60b12c43910d
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Querying array for volume VM-00.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Volume VM-00 does not exist on array 1202b08a-1593-4006-ba36-60b12c43910d
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Test failover volume not found.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Exiting
[07/24/2024 15:53:48,TestFailoverStop.cs:RemoveFailedoverVolumes,V] Test failover volume does not exist for target volume VM-00 on array 1202b08a-1593-4006-ba36-60b12c43910d - no need to disconnect and eradicate
[07/24/2024 15:53:48,TestFailoverStop.cs:RemoveFailedoverVolumes,V] Successfully eradicated test failover volume for target VM-00 from array 1202b08a-1593-4006-ba36-60b12c43910d
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Entering
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Querying array for volume VM-01-puresra-testFailover.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Volume VM-01-puresra-testFailover does not exist on array 1202b08a-1593-4006-ba36-60b12c43910d
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Querying array for volume VM-01.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Volume VM-01 does not exist on array 1202b08a-1593-4006-ba36-60b12c43910d
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Test failover volume not found.
[07/24/2024 15:53:48,TestFailoverStop.cs:GetTestFailoverVolume,V] Exiting
[07/24/2024 15:53:48,TestFailoverStop.cs:RemoveFailedoverVolumes,V] Test failover volume does not exist for target volume VM-01 on array 1202b08a-1593-4006-ba36-60b12c43910d - no need to disconnect and eradicate
[07/24/2024 15:53:48,TestFailoverStop.cs:RemoveFailedoverVolumes,V] Successfully eradicated test failover volume for target VM-01 from array 1202b08a-1593-4006-ba36-60b12c43910d
esxcli nvme info get command output shows the hostname NQN correctly set and working:
An example from vimdump data on the host :
FabricsInfo:
Host NQN: nqn.2014-08.com.vmware:nvme:ESXi01-dr
qualifiedName = (vim.host.QualifiedName) [
(vim.host.QualifiedName) {
dynamicType = <unset>,
dynamicProperty = (vmodl.DynamicProperty) [],
value = 'nqn.2014-08.com.xx:nvme:ESXi01-dr',
type = 'nvmeQualifiedName'
}
vCenter MOB must populate the vimdump data from the host above for SRM to use this information when performing failover activities. But, the screenshot shows that the value is unset.
VMware Live Site Recovery
VMware vCenter Server
VMware vSphere ESXi
NVMe NQNs are not persisted in VCDB because VPXD does not record the host hardware info in VCDB.
vCenter records the host hardware info during these instances -
1. A host is rebooted
2. A host is removed and re-added to vCenter inventory
3. hostd or vpxa services are restarted
With the fix in vCenter 8.0.1, VCDB will persist this information and VPXD will be able to record the hardware info in VCDB. This should force a host sync to update the value and have it recorded in VCDB.
Upgrade to vCenter Server 8.0.1 & above releases. Engineering is working on backporting the fix to vCenter 7.0.3 version.
1. Restart hostd (/etc/init.d/hostd restart) on these hosts
OR
2. Removing the hosts from inventory and re-add them (it must be removed, not just disconnect/connect).
Now, you must be able to see the vCenter MOB populating the qualifiedName & nvmeQualifiedName property.
1. Open MOB using the URL - https://vCenterFQDN/mob?moid=[host moid]&doPath=hardware.systemInfo
Example: https://ESXi01.xx/mob?moid=host-76145&doPath=hardware.systemInfo
Now, you should be able to run recoveries successfully.
What is an NVMe Qualified Name (NQN)?
The NVMe Qualified Name (NQN) is used to identify the remote storage target. The NVMe qualified name for the storage array is always assigned by the subsystem and may not be modified. There is only one NVMe Qualified Name for the entire array. The NVMe Qualified Name is limited to 223 characters in length. You can compare it to an iSCSI Qualified Name.