[VMC on AWS] Recovery plan fails for a VM with Operation Time Out
search cancel

[VMC on AWS] Recovery plan fails for a VM with Operation Time Out

book

Article ID: 335245

calendar_today

Updated On:

Products

VMware Live Recovery VMware Cloud on AWS

Issue/Introduction

This article provides steps to troubleshoot and resolve the issue to resolve recovery plan failure for a VM due to operation time out error.

Symptoms:

  • While running a recovery plan with multiple VMs, the recovery plan fails due to an operation time out error.
  • When adding the failed VM to a new protection group and attempting to initiate replication, the operation still fails.
  • The VM option to power on is greyed out in the vCenter UI.



Cause

Planned migration for the recovery plan fails at configure storage step for a specific VM with "Operation timed out: 900 seconds."

This is due to the VSR migration being invoked while TRIM/UNMAP is running with large deltas, triggering VSAN disk consolidation, causing VSR to reach the default timeout period.
The following can be seen in the HBR logs:

2023-07-22T15:43:13.857Z esx-xx.vmwarevmc.com Hostd: info hostd[2107594] [Originator@6876 sub=Vimsvc.TaskManager opID=hsl-46f677eb-5580 user=vpxuser] Task Created : haTask--vim.VirtualDiskManager.consolidateDisks-2139137619
2023-07-22T16:55:53.663Z esx-xx.vmwarevmc.com Hostd: info hostd[2103242] [Originator@6876 sub=Vimsvc.TaskManager opID=hsl-46f677eb-5580 user=vpxuser] Task Completed : haTask--vim.VirtualDiskManager.consolidateDisks-2139137619 Status success

Before the recovery plan failover takes place, the VM in question had a TRIM/UNMAP disk consolidation task begin which took longer than 1 hour, causing VSR to timeout the recovery plan.

Resolution

Disable TRIM/UNMAP for the Windows VM at the Guest OS level before the recovery plan starts so that no disk consolidation tasks begin as the recovery plan is mid-flight.

Optionally, increase the timeout value for the SRM in the Advanced settings of the SRM UI, on both sites of the pairing. The maximum values can be set as shown below:
1. remoteManager.defaultTimeout = 3600
2. remoteManager.taskDefaultTimeout = 3600
3. remoteManager.taskProgressDefaultTimeout = 540

Workaround:
Engage VMware Support to have KB1029926 executed on the VM in question. After this is completed, reconfigure the VM for replication.

Additional Information

Impact/Risks:
Customer will be unable to execute the recovery plan in full.