Migration and replication tasks fail because the task status did not update in a timely fashion in Cloud Director Availability 4.x

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

Tasks for managing existing or creating new replications and migrations fail with an error similar to:

Assuming task '4e2f98ec-####-####-####-########064' failed, because its status did not update in a timely fashion.

This issue occurs when one or more remote Replicators are offline or inaccessible.
In the /opt/vmware/h4/cloud/log/cloud.log file on the Cloud Replication Management Appliance of the destination site, you see entries similar to:

2020-07-12 01:07:51.174 ERROR - [UI-181f7e8f-####-####-####-########47d-JG] [task-poller-4] com.vmware.h4.jobengine.JobExecution     : Task 94b9abd4-####-####-####-########82f (WorkflowInfo{type='migrate', resourceType='vmReplication', resourceId='C4-372d3c94-####-####-####-########2f2', isPrivate=false, resourceName='MyApplication'}) has failed

com.vmware.vdr.error.exceptions.TaskMonitoringTimeOutException: Assuming task 'c8dc12bf-####-####-####-########e9e' failed, because its status did not update in a timely fashion.
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at com.vmware.h4.exceptions.GenericServerExceptionProvider.get(GenericServerExceptionProvider.java:120)
    at com.vmware.h4.exceptions.GenericServerExceptionProvider.get(GenericServerExceptionProvider.java:97)
    at com.vmware.h4.common.task.H4ApiTaskToTaskConverter.toTask(H4ApiTaskToTaskConverter.java:31)
    at com.vmware.task.rest.client.TaskMonitor.lambda$workImpl$0(TaskMonitor.java:191)
    at com.vmware.task.rest.client.TaskMonitor.notifyListener(TaskMonitor.java:213)
    at com.vmware.task.rest.client.TaskMonitor.workImpl(TaskMonitor.java:190)
    at com.vmware.task.rest.client.TaskMonitor.work(TaskMonitor.java:122)
    at com.vmware.h4.cloud.service.ManagerTaskMonitorService.lambda$taskMonitor$0(ManagerTaskMonitorService.java:107)
    at com.vmware.h4.common.mdc.MDCRunnableWrapper.run(MDCRunnableWrapper.java:30)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Cloud Director Availability 4.x

Cause

This issue occurs when the threads used by the scheduler become saturated while it waits for responses from the offline or inaccessible Replicators, which prevents it from updating items in the expected timeframe.

Resolution

To resolve this issue, if a remote site is not currently needed for active protections or migrations then it should either be left running to maintain the cross site connectivity or unpaired before being powered down.

For more information, see the Upair Paired Sites section of the Cloud Director Availability documentation.

If a site is needed for active protections or migrations, then all Replicators in that site should be online and accessible.

Additional Information

This error can also be generated when the Cloud Director Availability appliances and virtual infrastructure do not have their time synchronized or there is a delay interacting with and processing API calls to the destination vSphere environment.

For more information, see Assuming task failed, because it's status did not update in a timely fashion" error when configuring replications in vCloud Availability