ESXi Hosts across specific clusters may get into Non-Responsive state during SRM Test Recovery Plan Cleanup Task
book
Article ID: 312757
calendar_today
Updated On:
Products
VMware Live Recovery
Issue/Introduction
This article covers a specific scenario where storage devices has been presented to all the ESXi hosts without considering the SRM cluster level mappings for VMs to be recovered. Please validate the symptoms before proceeding with the solution provided in this article.
Symptoms:
The hosts getting into non responsive state report loss of heartbeat or connectivity to snapshot volumes which were presented during the SRM Test Recovery
Example of logs seen in ESXi host logs during the time frame of the issue:
2021-02-22T15:28:58.379Z info hostd[2101930] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5155 : Lost access to volume 6033c049-fabc4d4a-55f3-48df377555b0 (snap-567d4f75-VMFS-Datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 2021-02-22T15:28:58.380Z info hostd[2101092] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5156 : Lost access to volume 6033c04a-edc5d9a2-4196-48df37800ae0 (snap-5c713ae1-VMFS-Datastore2) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
The volumes which are reporting errors during SRM Test Cleanup Task had no actual mapping done from SRM and no VMs which actually got registered onto this specific cluster
Environment
VMware vCenter Site Recovery Manager 8.x
Cause
SRM solution by design considers only the hosts where the mapping has been done and the VMs get registered during test/failover
During the test workflow, snap-device(presented with a copy of the production VMFS volume) is created in the recovery site, then SRM triggers related hosts to rescan storage to get the snap-device attached and volume mounted
During the afterward cleanup workflow, the volume is unmounted and the snap-device is detached from the related hosts post which the snap-device is dismissed
Between the above test and cleanup workflow, if the snap-device is wrongly mounted to unrelated hosts by manually rescan or rescan triggered by another workflow. The device can't be unmounted from unrelated hosts in the cleanup workflow. These hosts then would lose access to the snap-device when it dismissed
Resolution
Please consult your Storage Vendor and SRA administration guide when storage is presented to ESXi host clusters when there is DR Solution Implemented
Presenting Storage to unwanted cluster of hosts while DR Solution is in place can lead to outages when recovery is being performed
Workaround:
SRM by design will not consider un-wanted hosts where the VM workloads is not registered even though the storage gets mounted to them when a master plan or multiple plans are run
In order to workaround this issue, you can fix the presentation issues from storage side OR
Deploy some VMs and do the SRM cluster level mapping to the effected cluster so SRM can register them during test recovery and perform graceful cleanup process