VMware Site Recovery Manager 8.x
VMware Live Site Recovery 9.x
VMware Live Recovery 9.x
1. LUN presentation to hosts are incorrect or incomplete.
2. Placeholder datastore mapping is wrong
3. Resource mapping is wrong
4. VMs are replicated to the wrong target vSAN datastore
5. vCenter license key has expired
1. LUN Presentation
2. Resource Pools
3. Replicating to the wrong vSAN Cluster (Datastore)
4. Virtual Volumes (vVols)
5. Expired vCenter License
Fixing TEST recovery
If you have run a TEST recovery, fixing this is going to be easy. When a TEST recovery is performed, SRM instructs the storage array or vSphere replication to take snapshots which can be easily cleaned-up by running a CLEANUP task from SRM. You just need to perform a CLEANUP & do the following.
Fixing PLANNED MIGRATION
When a PLANNED MIGRATION is run, you cannot revert this operation until the workflow is completed. Whether you use VR or ABR, SRM attempts to shut down the protected virtual machines gracefully if VMware tools is installed, if not its powered OFF. This is a critical situation for the customer as they are in the midst of moving their workloads from site A > site B within a specified Recovery Time Objective (RTO). When a planned migration halts and throws this error there's no way forward other than performing a manual failback because we can't complete the workflow until all issues are fixed at the recovery site.
When recovery fails using VR, the replication of the virtual machines reverts to the replication state before the attempted recovery thus making it much easier to perform a manual failback in comparison to ABR.
vSphere Replication manual recovery steps
CAUTION: Intervening a SRM recovery plan to make manual changes during the workflow orchestration or performing actions or tasks on the array when SRM is managing it will break the SRM PG & RP.
Storage based replication manual recovery steps (ABR)
If you are using array based replication, the LUNs are failed over to the other site and mounted to the respective hosts/clusters where its presented thus requiring a manual failback of all the VMs in the recovery plan.
The process of promoting or demoting a LUN arises when a LUN is setup for replication. When we initiate a PLANNED MIGRATION (failover) from a site, the replicating LUN is demoted at the source site and promoted at the target site thereby enabling read/write access to that LUN & allowing it to be mounted as a datastore.
LUN Promoted - When a LUN failover is triggered, the target LUN is marked as read/write by the array whereas the source LUN become read only.
LUN Demoted - When a LUN failover is triggered, the source LUN is marked as read only & does not respond to reads or writes from the hosts its connected to.
Now, this process is completely reversible either thru SRM or externally thru the array. Since, SRM demands a recovery workflow to be completed successfully, it won't allow us to perform a REPROTECT until the recovery task completes. Now, since we can't fix this error, we'll have to make one of the two choices below to failback.
A. Reverse the replication of LUNs from target to source
B. Promote the source LUN
A. Reverse the replication of LUNs from target to source
If the consistency of data is not questionable, then you can decide to choose this option. Please check with the customer before proceeding to choose this option, else you might end up replicating corrupted blocks or bad data to the source LUN. This operation must be performed by a storage administrator or by the storage vendor. We DO NOT do this part of it. This option won’t be required for this error because the VMs will not be recovered on the datastore and powered ON, since this is the problem we are working on. This option will be useful when VMs are powered ON, on the datastore and are being used by users leading to data changes. I’ve just mentioned it here as it falls within the context of things we are discussing.
RECOVERY SITE
PREPARING FOR FAILBACK
Promoting the source LUN(s)
If you don't want to replicate the LUN back to the source site due to factors like -
NOTE: Before proceeding, complete steps 1–6 under RECOVERY SITE (skip any that don’t apply). At the PRODUCTION SITE, promote all LUNs that were failed over to the recovery site to RW mode. If you’re not confident performing this task correctly, VMware by Broadcom recommends involving the storage vendor. Depending on the array, you may then need to break and re-establish the replication pairs.
PRODUCTION SITE
When replicating VMs to the recovery site consisting of many vSAN clusters/datastores. You must pay attention to the vSAN datastore you are replicating to & the resource mappings. You will encounter this error if -
1. The target vSAN datastore you are replicating to belongs to a different host cluster than the one added to resource mappings.
2. The resource mappings are wrong causing the placeholder VMs to be created on a different vSAN cluster which is not connected to the vSAN datastore consisting the replicas.
TEST Recovery
When a TEST recovery is performed, please perform the following steps -
1. Run a cleanup
2. Gracefully remove VMs from under replication tab, DO NOT retain seeds as the replica seeds are sitting on the wrong vSAN datastore
3. Reconfigure VMs for replication and point them to the correct vSAN datastore
4. Check and update the correct resource mappings
5. Unprotect the VMs from protection group and restore all placeholder VMs
6. Run a TEST and PLANNED MIGRATION to check if everything is working as expected.
To manually recover VMs from a PLANNED MIGRATION, please refer to the steps under vSphere Replication manual recovery steps
VIRTUAL VOLUMES (vVols)
Failback on Virtual Volumes (vVol) datastores may fail with this error:
No hosts with hardware version '19' and datastore(s) '"xxx"' which are powered on and not in maintenance mode are available.
This issue is fixed in Site Recovery Manager 8.7.0.3 | 05 SEP 2023 | Build 22359471