ESXi Hosts across specific clusters may get into Non-Responsive state during SRM Test Recovery Plan Cleanup Task
search cancel

ESXi Hosts across specific clusters may get into Non-Responsive state during SRM Test Recovery Plan Cleanup Task

book

Article ID: 312757

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

This article covers a specific scenario where storage devices has been presented to all the ESXi hosts without considering the SRM cluster level mappings for VMs to be recovered. Please validate the symptoms before proceeding with the solution provided in this article.

Symptoms:
  • The hosts getting into non responsive state report loss of heartbeat or connectivity to snapshot volumes which were presented during the SRM Test Recovery

Example of logs seen inĀ  ESXi host logs during the time frame of the issue:

2021-02-22T15:28:58.379Z info hostd[2101930] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5155 : Lost access to volume 6033c049-fabc4d4a-55f3-48df377555b0 (snap-567d4f75-VMFS-Datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-02-22T15:28:58.380Z info hostd[2101092] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 5156 : Lost access to volume 6033c04a-edc5d9a2-4196-48df37800ae0 (snap-5c713ae1-VMFS-Datastore2) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
  • The volumes which are reporting errors during SRM Test Cleanup Task had no actual mapping done from SRM and no VMs which actually got registered onto this specific cluster


Environment

VMware vCenter Site Recovery Manager 8.x

Cause

  • SRM solution by design considers only the hosts where the mapping has been done and the VMs get registered during test/failover
  • During the test workflow, snap-device(presented with a copy of the production VMFS volume) is created in the recovery site, then SRM triggers related hosts to rescan storage to get the snap-device attached and volume mounted
  • During the afterward cleanup workflow, the volume is unmounted and the snap-device is detached from the related hosts post which the snap-device is dismissed
  • Between the above test and cleanup workflow, if the snap-device is wrongly mounted to unrelated hosts by manually rescan or rescan triggered by another workflow. The device can't be unmounted from unrelated hosts in the cleanup workflow. These hosts then would lose access to the snap-device when it dismissed

Resolution

  • Please consult your Storage Vendor and SRA administration guide when storage is presented to ESXi host clusters when there is DR Solution Implemented
  • Presenting Storage to unwanted cluster of hosts while DR Solution is in place can lead to outages when recovery is being performed


Workaround:
  • SRM by design will not consider un-wanted hosts where the VM workloads is not registered even though the storage gets mounted to them when a master plan or multiple plans are run
  • In order to workaround this issue, you can fix the presentation issues from storage side OR
  • Deploy some VMs and do the SRM cluster level mapping to the effected cluster so SRM can register them during test recovery and perform graceful cleanup process