Error: "Assuming task ... failed, because its status did not update in a timely fashion." when failing over large vApp Replications

Products

VMware Cloud Director

Issue/Introduction

When failing over a vApp replication containing more than 10 VM replications in VMware Cloud Director Availability (VCDA), the failover operation for some of the VM replications fail with an error similar to:

Assuming task '########-####-####-####-########0230' failed, because its status did not update in a timely fashion.
In the /opt/vmware/h4/manager/log/manager.log file on the destination Cloud Director Replication Management Appliance, you see messages related to the failure similar to:

2025-05-10 05:50:16.221 WARN - [########-####-####-####-########54fa] [task-poller-3] com.vmware.task.rest.client.TaskMonitor : Task ########-####-####-####-########0230 has timed out (it hasn't been updated since 1746856156203, in 60000 msec)
2025-05-10 05:50:16.221 ERROR - [UI-########-####-####-####-########efd8-r955-AZ-NYE-9W] [task-poller-3] com.vmware.h4.jobengine.JobExecution : Task ########-####-####-####-########06cf (WorkflowInfo{type='failover', resourceType='replication', resourceId='H4-########-####-####-####-########71ed', isPrivate=false, resourceName='<VM Name>'}) has failed

com.vmware.h4.api.error.exceptions.TaskMonitoringTimeOutException: Assuming task '########-####-####-####-########0230' failed, because its status did not update in a timely fashion.
In the /opt/vmware/h4/replicator/log/replicator.log file on the destination Replicator Appliance, you see the failover task eventually run and complete after the above timeout:

2025-05-10 06:17:24.835 DEBUG - [UI-########-####-####-####-########efd8-r955-AZ-NYE-9W-GO] [job-48] com.vmware.h4.jobengine.JobExecution : Task ########-####-####-####-########0230 (WorkflowInfo{type='failover', resourceType='replication', resourceId='H4-########-####-####-####-########71ed', isPrivate=false, resourceName='null'}) completed with result DestinationReplicationState{id='H4-########-####-####-####-########71ed', currentRpoViolation=-1, latestInstance=null, state=null, recoveryInfo=RecoveryInfo{recoveryState=COMPLETE, vcId='########-####-####-####-########dc51', vmId='vm-###', vmName='<VM Name>', optimizeUntil=null, isMigration=null}, lastError=null, replicatedDisks=[ReplicatedDiskInfo{diskKey=2001, uuid=########-####-####-####-########2adf, diskType='thin', baseName='<VM Disk>.vmdk', capacityBytes=375809638400, isSeed=false, isReplicated=true, spaceRequirement=-1}, ReplicatedDiskInfo{diskKey=2000, uuid=########-####-####-####-########d634, diskType='thin', baseName='<VM Disk>.vmdk', capacityBytes=42949672960, isSeed=false, isReplicated=true, spaceRequirement=-1}], spaceRequirement=-1, isMovingReplica=false}
This issue can occur when the VM replications have multiple disks, are using multiple point-in-time instances, and consolidation is enabled for failover.

Environment

VMware Cloud Director Availability 4.7.x

Cause

This issue occurs in vApp replications that contain more than 10 VM replications, where the individual VM replication failover tasks take longer to complete than expected resulting in the Manager service timing out while waiting for an update on the Replicator service tasks.

Resolution

VMware Cloud Director Availability Replicator appliances, by default, can process up to 10 threads concurrently per user. If the per-VM failover is taking longer than the expected 60 seconds due to tasks such as instance consolidation, you can increase the number of concurrent threads based on the size of the largest vApp in your environment.

Note: This procedure modifies a configuration file. Ensure to take a backup of the file before proceeding.

SSH to one of the Replicator Appliances and log in as root.
Edit the Replicator service's application.properties file:

vi /opt/vmware/h4/replicator/config/application.properties
Uncomment the jobengine.threads.per.user line and increase the value to a number equal to the largest vApp:

Example: If the largest protected vApp in your environment contains 20 VMs, you can set the value to 20.

jobengine.threads.per.user=20
Restart the Replicator Appliance.
Repeat steps 1-4 for each other Replicator Appliance in both source and destination sites.

Additional Information

For more information on backing up a Replicator Appliance, see Backing up and restoring in the Cloud Director site.