Array based test recovery is taking time in hours to recover the VMs when ran simultaneous test recovery on multiple protection group
search cancel

Array based test recovery is taking time in hours to recover the VMs when ran simultaneous test recovery on multiple protection group

book

Article ID: 384430

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

Symptoms 

  • Running simultaneous Test recovery on the multiple protection group will take time in hours 
  • Maximum time spent on task "Creating Writable Storage Snapshot"  

Environment

Site Recovery Manager 8.x

Cause

Note: In order to troubleshoot the issue please validate /opt/vmware/support/logs/srm/vmware-dr.log 

  • Look for "testFailoverStart" keyword and follow the events

SRA sent the test Failover command to storage

2024-##-##T##:14:41.606+05:30 info vmware-dr[02135] [SRM@6876 sub=Storage.StorageManagerSraQueue opID=182219de-0ca8-4ec4-91dc-c97f58e4f032-test:ee9f:93d7:e408:76e1] Starting task 'Task-212888', queue 'TaskQueue-13' at: ExecuteSraCommand(1433)
2024-##-##T##:14:41.606+05:30 info vmware-dr[02135] [SRM@6876 sub=SraCommand opID=182219de-0ca8-4ec4-91dc-c97f58e4f032-test:ee9f:93d7:e408:76e1] Command line for testFailoverStart: /usr/bin/docker exec -i -u 660:660 6ad87b4f4bba /srm/sra/command
--> <?xml version="1.0" encoding="UTF-8"?>
--> <Command xmlns="http://www.vmware.com/srm/sra/v2">
-->   <Name>testFailoverStart</Name>
-->   <OutputFile>/tmp/sra-output-64-231</OutputFile>
-->   <StatusFile>/tmp/sra-status-65-203</StatusFile>
-->   <LogLevel>verbose</LogLevel>
-->   <LogDirectory>/srm/sra/log</LogDirectory>
-->   <Connections>
[cont..]
-->     </Connection>
-->   </Connections>
-->   <TestFailoverStartParameters>
-->     <ArrayId>320581</ArrayId>
-->     <TargetGroups>
-->       <TargetGroup key="AP######R2"/>
-->     </TargetGroups>
-->   </TestFailoverStartParameters>
--> </Command>
-->

  • Validate the Storage Snapshot creation events and the time taken for the task

--> [RMSRA20][Th# D## 1# 00:14:43 2024]: [testfailover] : Commence: SnapShot Copy [ lunkey=2308, dev_group=AP######R2] is being created for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:14:43 2024]: [testfailover] : SnapShot Volume [ SnapShot_Lunwwn=600##########################904 ] is being split and exposed via '/usr/bin/pairsplit -IH -rw -dg 320581 2308 1 1>/dev/null' on 'SnapShot LDEV=2308, VOLST=S-VOL, STAT=PAIR, COPYMD=-'.
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Complete : SnapShot Copy [ lunkey=2308, dev_group=AP######R2] has been created and presented for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Commence: SnapShot Copy [ lunkey=2309, dev_group=AP######R2] is being created for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][TTh# D## 1# 00:15:05 2024]: [testfailover] : SnapShot Volume [ SnapShot_Lunwwn=600##########################905 ] will being split and exposed automatically as CT Group on 'SnapShot LDEV=2309, VOLST=S-VOL, STAT=PAIR, COPYMD=-'.
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Complete : SnapShot Copy [ lunkey=2309, dev_group=AP######R2] has been created and presented for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Commence: SnapShot Copy [ lunkey=2310, dev_group=AP######R2] is being created for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : SnapShot Volume [ SnapShot_Lunwwn=600##########################906 ] will being split and exposed automatically as CT Group on 'SnapShot LDEV=2310, VOLST=S-VOL, STAT=PAIR, COPYMD=-'.
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Complete : SnapShot Copy [ lunkey=2310, dev_group=AP######R2] has been created and presented for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Commence: SnapShot Copy [ lunkey=2311, dev_group=AP######R2] is being created for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : SnapShot Volume [ SnapShot_Lunwwn=600##########################907 ] will being split and exposed automatically as CT Group on 'SnapShot LDEV=2311, VOLST=S-VOL, STAT=PAIR, COPYMD=-'.
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [testfailover] : Complete : SnapShot Copy [ lunkey=2311, dev_group=AP######R2] has been created and presented for testFailoverStart request on array [ 320581 ].
--> [RMSRA20][Th# D## 1# 00:15:05 2024]: [exit_main]  : returns with [0]

  • The Storage Writable snapshot creation task has been completed in 1 minute. 

2024-12-12T00:15:03.879+05:30 verbose vmware-dr[01632] [SRM@6876 sub=SraCommand opID=182219de-0ca8-4ec4-91dc-c97f58e4f032-test:ee9f:93d7:e408:76e1] testFailoverStart responded with:

--> <?xml version="1.0" encoding="UTF-8" ?>

--> <Response xmlns="http://www.vmware.com/srm/sra/v2">

-->   <TestFailoverStartResults>

-->     <TargetGroups>

-->       <TargetGroup key="AP######R2">

-->         <Snapshot id="AP######R2">

-->           <RecoveryPoint id="latest">

-->             <Name>APPSERVER2</Name>

-->             <Time>2010-07-10T02:10:00.0</Time>

-->           </RecoveryPoint>

-->           <TargetDevices>

-->             <TargetDevice key="2308">

-->               <Snapshot id="2308">

-->                 <Identity>

-->                   <Wwn>60:##:##:##:##:##:##:##:##:##:##:##:##:##:09:04</Wwn>

-->                 </Identity>

-->               </Snapshot>

-->             </TargetDevice>

-->             <TargetDevice key="2309">

-->               <Snapshot id="2309">

-->                 <Identity>

-->                   <Wwn>60:##:##:##:##:##:##:##:##:##:##:##:##:##:09:05</Wwn>

-->                 </Identity>

-->               </Snapshot>

-->             </TargetDevice>

-->             <TargetDevice key="2310">

-->               <Snapshot id="2310">

-->                 <Identity>

-->                   <Wwn>60:##:##:##:##:##:##:##:##:##:##:##:##:##:09:06</Wwn>

-->                 </Identity>

-->               </Snapshot>

-->             </TargetDevice>

-->             <TargetDevice key="2311">

-->               <Snapshot id="2311">

-->                 <Identity>

-->                   <Wwn>60:##:##:##:##:##:##:##:##:##:##:##:##:##:09:07</Wwn>

-->                 </Identity>

-->               </Snapshot>

-->             </TargetDevice>

-->           </TargetDevices>

-->         </Snapshot>

-->       </TargetGroup>

-->     </TargetGroups>

-->   </TestFailoverStartResults>

--> </Response>

-->

  • List of Snapshots got created during the task

2024-12-12T00:15:21.472+05:30 info vmware-dr[02143] [SRM@6876 sub=AbrRecoveryEngine opID=182219de-0ca8-4ec4-91dc-c97f58e4f032-test:ee9f:93d7:e408:76e1] Completed HBA rescan for host 'host-5964:#####21#0318.####.###############.com'

2024-12-12T00:15:21.643+05:30 info vmware-dr[02188] [SRM@6876 sub=AbrRecoveryEngine opID=182219de-0ca8-4ec4-91dc-c97f58e4f032-test:ee9f:93d7:e408:76e1] Completed HBA rescan for host 'host-5969:#####19#0318.####.###############.com'

  • The HBA rescan occurred on #####21#0318.####.###############.com and #####19c0318.####.###############.com
  • But did not occurred on the host #####20#0318.####.###############.com
  • Post Validating the impacted host logs i.e. #####20#0318.####.###############.com

verbose vmware-dr[02173] [SRM@6876 sub=PerformanceMonitor opID=16273c48-4d31-4a9d-8175-375ec1b250e7-test:5d32:bae5:8bba:c8d8] Completed AttachScsiLuns. 'AttachEx from 'host-5962:#####20#0318.####.###############.com' :

verbose vmware-dr[02184] [SRM@6876 sub=PerformanceMonitor opID=b18c7ab0-e95d-417f-93f5-49cf00edb46b-test:5a8f:ea61:e864:1961] Completed AttachScsiLuns. 'AttachEx from 'host-5962:#####20#0318.####.###############.com' :

error vmware-dr[01639] [SRM@6876 sub=AbrRecoveryEngine opID=71c70e0a-47d9-4c71-a7ea-8c212776da84-test:eae2:b377:8589:f77d] Failed to rescan HBAs on host 'host-5962:#####20#0318.####.###############.com':

-->    hostName = "#####20#0318.####.###############.com",

error vmware-dr[02187] [SRM@6876 sub=AbrRecoveryEngine opID=16273c48-4d31-4a9d-8175-375ec1b250e7-test:5d32:bae5:8bba:c8d8] Failed to rescan HBAs on host 'host-5962:#####20#0318.####.###############.com':

-->    hostName = "#####20#0318.####.###############.com",

verbose vmware-dr[02125] [SRM@6876 sub=PerformanceMonitor opID=71c70e0a-47d9-4c71-a7ea-8c212776da84-test:eae2:b377:8589:f77d] Completed AttachScsiLuns. 'AttachEx from 'host-5962:#####20#0318.####.###############.com' :

-->             hostName = "#####20#0318.####.###############.com",

-->             hostName = "#####20#0318.####.###############.com",

-->                   hostName = "#####20#0318.####.###############.com",

  • On the host we see Failed to rescan the HBAs errors. 

 

Resolution

In the case rebooting the host "#####20#0318.####.###############.com" addressed the issue.