Symptoms:
2020-05-18 09:02:17.005 ERROR - [UI-6d06ab27-####-####-####-########077-ro-iJ_n93] [c4-scheduler-1] com.vmware.h4.jobengine.JobExecution : Task ########-####-####-####-########93ec (WorkflowInfo{type='failoverTest', resourceType='vmReplication', resourceId='C4-cbe6d486-####-####-####-########2be', isPrivate=false, resourceName='Test8RHE'}) has failed
com.vmware.h4.cloud.api.exceptions.VappLockedException: Could not obtain exclusive access to the vApp for replication 'C4VAPP-########-####-####-####-########dcce' because another failover for a vm from the same vApp has locked it.
at com.vmware.h4.cloud.job.VmFailoverJob.lambda$importIntoVcd$6(VmFailoverJob.java:350)
at com.vmware.h4.jobengine.lock.JobLock.lambda$lock$2(JobLock.java:92)
at com.vmware.h4.jobengine.lock.LockManager.invokeHandler(LockManager.java:286)
at com.vmware.h4.jobengine.lock.LockManager.expire(LockManager.java:269)
at com.vmware.h4.jobengine.lock.LockManager.lambda$obtain$1(LockManager.java:179)
at com.vmware.h4.common.mdc.MDCRunnableWrapper.run(MDCRunnableWrapper.java:30)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
This issue occurs because Cloud Director Availability starts all VM failover jobs in parallel, but the tasks become serialized in Cloud Director as they attempt to obtain lock for the same target vApp.
The first failover job that obtains that lock does so for the whole failover process and the rest of the VMs wait. Once the lock is released, the next job that manages to obtain it proceeds with its failover and so on until all VMs have been failed over.
The default timeout for obtaining a vApp lock is 10 minutes, if a VM fails to acquire lock within these 10 minutes then the failover task fails.
This is a known issue affecting Cloud Director Availability 4.x.
Currently, there is no resolution.
Workaround:
To work around this issue, perform the failover or test failover one VM at a time for each of the failed VMs.
Note: Do not retry the failover on the vApp level again because this deletes the already successful failovers.