"Could not obtain exclusive access to the vApp for replication..." error when failing over multiple VMs concurrently in Cloud Director Availability

search cancel

"Could not obtain exclusive access to the vApp for replication..." error when failing over multiple VMs concurrently in Cloud Director Availability

book

Article ID: 315161

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

Performing a failover, test failover, or test cleanup for multiple VMs in the same protection group or vApp simultaneously fails.
In /opt/vmware/h4/cloud/log/cloud.log on the Cloud Replication Management Appliance, you see a similar error:

2020-05-18 09:02:17.005 ERROR - [UI-6d06ab27-####-####-####-########077-ro-iJ_n93] [c4-scheduler-1] com.vmware.h4.jobengine.JobExecution : Task ########-####-####-####-########93ec (WorkflowInfo{type='failoverTest', resourceType='vmReplication', resourceId='C4-cbe6d486-####-####-####-########2be', isPrivate=false, resourceName='Test8RHE'}) has failed

com.vmware.h4.cloud.api.exceptions.VappLockedException: Could not obtain exclusive access to the vApp for replication 'C4VAPP-########-####-####-####-########dcce' because another failover for a vm from the same vApp has locked it.

at com.vmware.h4.cloud.job.VmFailoverJob.lambda$importIntoVcd$6(VmFailoverJob.java:350)

at com.vmware.h4.jobengine.lock.JobLock.lambda$lock$2(JobLock.java:92)

at com.vmware.h4.jobengine.lock.LockManager.invokeHandler(LockManager.java:286)

at com.vmware.h4.jobengine.lock.LockManager.expire(LockManager.java:269)

at com.vmware.h4.jobengine.lock.LockManager.lambda$obtain$1(LockManager.java:179)

at com.vmware.h4.common.mdc.MDCRunnableWrapper.run(MDCRunnableWrapper.java:30)

at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)

at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)

at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)

at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

at java.base/java.lang.Thread.run(Thread.java:834)

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware vCloud Availability 3.0.x
VMware Cloud Director Availability 4.x
VMware vCloud Availability 3.5.x

Cause

This issue occurs because Cloud Director Availability starts all VM failover jobs in parallel, but the tasks become serialized in Cloud Director as they attempt to obtain lock for the same target vApp.

The first failover job that obtains that lock does so for the whole failover process and the rest of the VMs wait. Once the lock is released, the next job that manages to obtain it proceeds with its failover and so on until all VMs have been failed over.

The default timeout for obtaining a vApp lock is 10 minutes, if a VM fails to acquire lock within these 10 minutes then the failover task fails.

Resolution

This is a known issue affecting Cloud Director Availability 4.x.
Currently, there is no resolution.

Workaround:
To work around this issue, perform the failover or test failover one VM at a time for each of the failed VMs.

Note: Do not retry the failover on the vApp level again because this deletes the already successful failovers.

Feedback

thumb_up Yes

thumb_down No