Failed to recover datastore. VMFS volume residing on recovered devices cannot be found.

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

This error typically indicates that the VMFS datastore (or its underlying storage devices) cannot be accessed or mounted after a recovery plan (PLANNED MIGRATION/DISASTER RECOVERY) is RUN using SRM. It also indicates a problem with VMFS itself or the LUN devices backing it.

Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"################################"' cannot be found. Recovered device '################################' not found after HBA rescan. Some virtual machines in the protection group 'Broadcom' could not be recovered

Error - Some virtual machines in the protection group 'Broadcom' could not be recovered. Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"1047"' cannot be found.

Error - Some virtual machines in the protection group 'Broadcom' could not be recovered
Failed to recover datastore 'VMBroadcom'.
VMFS volume residing on recovered devices '"//Broadcom-1/Broadcom-2/Broadcom-3"' cannot be found.
Recovered device '//Broadcom-1/Broadcom-2/Broadcom-3' not found after HBA rescan.

/var/log/vmware/srm/vmware-dr.log:

-->             msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found."
-->                   msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found."
-->             msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found. Some virtual machines in the protection group 'Broadcom' could not be recovered"
-->                   msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found."
-->             msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found. Some virtual machines in the protection group 'Broadcom' could not be recovered"
-->             msg = "Failed to recover datastore 'VMBroadcom'. VMFS volume residing on recovered devices '"31710"' cannot be found."

-->       "                      <msg>Failed to recover datastore &apos;VMBroadcom&apos;. VMFS volume residing on recovered devices &apos;&quot;31710&quot;&apos; cannot be found.</msg
>",
-->       "                      <protectedName>VMBroadcom</protectedName>",
-->       "                      <protectedUrl>ds:///vmfs/volumes/########-c3125ce7-bea5-############/</protectedUrl>",
-->       "    </e>",
-->       "  </faults>",
-->       "              <msg>Failed to recover datastore &apos;VMBroadcom&apos;. VMFS volume residing on recovered devices &apos;&quot;31710&quot;&apos; cannot be found.</msg>",
-->       "</fault>",
-->       "          <Children>",
-->       "            <Step elapsedTime="00:00:00" endTime="2024-10-17T13:55:07Z" objectId="protected-vm-5560" startTime="2024-10-17T13:55:07Z" status="error">",
-->       "              <Key>RecoveryStepConfigStorageOp.name</Key>",
-->       "              <Name>Configure storage</Name>",
-->       "              <fault>",
-->       "                  <_type>dr.storageProvider.fault.DatastoreRecoveryFailed</_type>",
-->       "                  <faultCause>",
-->       "                      <_type>dr.storageProvider.fault.RecoveryVmfsVolumeNotFound</_type>",
-->       "                      <device>",
-->       "                          <_length>1</_length>",
-->       "                          <_type>string[]</_type>",
-->       "                          <e id="0">31710</e>",
-->       "    </device>",
-->       "                      <msg/>",
-->       "  </faultCause>",
-->       "                  <msg>Failed to recover datastore &apos;VMBroadcom;. VMFS volume residing on recovered devices &apos;&quot;31710&quot;&apos; cannot be found.</msg>",
-->       "                  <protectedName>VMBroadcom</protectedName>",
-->       "                  <protectedUrl>ds:///vmfs/volumes/########-c3125ce7-bea5-############/</protectedUrl>",

Environment

VMware Site Recovery Manager
VMware Live Site Recovery
VMware vSphere ESXi

Cause

1. LUN Mapping - The device mappings might be incorrect or missing, causing the recovery process to fail.

2. Resource mapping - Incorrect resource mappings in SRM

3. VMFS datastore - The VMFS datastore might be damaged or have partition table errors, preventing the recovery process from finding the VMFS header to mount the datastore.

Resolution

Follow this checklist of possible causes and troubleshoot in the order mentioned below.

Check the Storage Replication Adapters (SRA) in SRM UI to know the current status of the failed over recovery plan. Sort by Status to see a list of devices/datastores that show -

1. Failover in progress - This status indicates that the storage array is still busy processing the failover of LUNs from the source site to the target site. Please note that this status may also be wrong or stuck. The best way to tell the actual status of a LUN failover would be to check the storage array real time thru CLI or GUI. If the storage array shows that the LUN replication is still in progress, please wait. If you think that you've waited enough or you are seeing some error messages, please log a case with your storage vendor support for further assistance.

If you have deleted the Replication Group/Consistency Group/Protection Group (Just a few different terminologies used by array vendors to represent LUN replication) because you had to or wanted to as the Replication Group had a problem or it had to be recreated on purpose for miscellaneous reasons but you are still seeing Failover in progress in SRM, then this is likely a cause of SRM still caching the information in it's database. Please log a case with SRM support for further assistance on this.

2. Failover complete - This status indicates that the storage array has completed the failover of LUNs. Continue to read this KB further to troubleshoot this issue.

After running a PLANNED MIGRATION/DISASTER recovery, the replica or snapshot LUN on the target site from the target array must get mapped to all the hosts in the desired cluster where the placeholder VMs are located. You must talk to your internal storage administrator or a support engineer from the storage vendor side to get the LUN # (naa or eui ID) and then check the list below.

1. Is this LUN mapped to all the hosts.

Go to the ESXi host > Configure > Storage Device > Click on the tiny filter icon and paste the LUN # and Enter. The LUN MUST be mapped to all the hosts in the cluster with the same LUN # and SRM expects the datastore to be visible to all the hosts in the cluster because the placeholder VMs are randomly created on any host in the cluster thereby making it imperative for all hosts to have access to the same datastore. Mapping hosts with different LUN (Logical Unit Number) numbers can cause several problems but this KB's purpose is limited to what is being discussed here. VMware by Broadcom support recommends to use a uniform LUN # while mapping to hosts.

2. You might need to increase the Disk.MaxLUN parameter in the ESXi hosts if your environment uses LUN IDs that are greater than 1023 (This number varies with ESXi versions).

Change the Number of Scanned Storage Devices

3. Rescan the Storage Adapters at the cluster object level and check if the storage devices (LUNs) become visible

Changing the Disk.MaxLUN parameter on ESXi Hosts

4. If the LUNs become visible following a rescan, follow the instructions below.

Check these settings on both the SRM sites and share information (You must check these values at both the sites and mention it under the SRM VM Name for comparison) before changing them to the values mentioned below.

storageProvider.hostRescanDelaySec = 120
storageProvider.hostRescanRepeatCnt = 4
storageProvider.hostRescanTimeoutSec = 300 (The default is 300 secs, you can change this if necessary upto 500 secs)

CAUTION: These settings will help in resolving any delays that the storage array introduces in mapping the LUNs to the host/s but at the same time this will also shoot up the recovery (RTO) time significantly depending on the values set for these parameters because SRM will take that much longer to complete a recovery plan.

5. LUN is mapped to the wrong cluster. If the LUN is mapped to the wrong cluster, you'll have to remap the LUNs to the correct cluster but you still won't be able to continue running the same recovery plan to complete the recovery process as SRM will not identify the mapping. Look at the resource mappings in SRM and think whether you went wrong with it or the LUN mapping ?

A. You'll have to recover the VMs manually at the PRODUCTION site.
B. Delete the Protection Group & Recovery Plan (Manual IP customization may get deleted, please note them down)
C. Recreate the Protection Group & Recovery Plan
D. Delete and recreate the resource mappings in SRM, if it's wrong.

NOTE: I have just mentioned very high level steps of this process but this process is complex, please raise a case with SRM support team for assistance.

6. LUN is visible to the host but not mounting as a datastore.

Check the VMkernel logs of the host to find out why the host is unable to mount it as a datastore

[root@ESXi:~] cd /var/log
[root@ESXi:/var/log] less vmkernel.log | grep -i naa.################################

The errors will show whether the VMFS datastore is corrupt or missing a partition table, etc.

Based on the errors found in the logs, a SRM Engineer will collaborate with a vSphere Storage Engineer and work in tandem to triage the issue and pave the way for an effective resolution.

Additional Information

Preliminary Checks

1. Replication direction ?

What is the current replication flow ? Is the replication flowing from Source > Target or Target > Source. The replication direction potentially changes the troubleshooting direction drastically. To quote an example - If you wanted to recover the VMs at the target site but the LUN failover was incomplete or broken then SRM support will have to take a different approach to assist you in bringing back the workload to the source site and powering ON all the VMs.

2. What was the last thing run in SRM ?

Did you RUN a PLANNED MIGRATION or REPROTECT or DISASTER RECOVERY. The workflow of all these options are distinct. It's good to check the SRM history to find out what was run in a sequence and also good for the SRM administrator to give a detailed explanation of every change that was done from SRM and the Storage array perspective so that the support engineer you are working with is bought up to speed on the current state of affairs and doesn't waste any time in further compromising your RTO.

3. Where do you want the VMs to be recovered now ?

This is totally dependent on the replication direction and can be changed with the help of the storage administrator who knows how to reverse a LUN replication and promote the LUN. If your storage administrators aren't confident doing this, then it's better to involve a support engineer from the Storage vendor to assist you with this task.