Failover erroring out in vCDA with error "Assuming task ... failed, because its status did not update in a timely fashion."
search cancel

Failover erroring out in vCDA with error "Assuming task ... failed, because its status did not update in a timely fashion."

book

Article ID: 442252

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

  • When failing over a vApp/VM replication in VMware Cloud Director Availability (VCDA), the failover operation errored out:
    "Assuming task ... failed, because its status did not update in a timely fashion."
  • Below is the log entry for the error in /opt/vmware/h4/manager/log/manager.log file on the destination Cloud Director Replication Management Appliance:
    ####-##-## ##:##:##.##  WARN - [#######-####-####-####-########] [hbr-poller1] c.v.h.c.e.ExceptionConversionService: Unable to convert exception. Using fallback exception instead.com.vmware.vim.binding.hbr.replica.fault.HbrRuntimeFault: Error for (datastoreUUID: "<datastore_UUID>"), (diskId: "Disk_UUID"), (hostId: "<Host-ID>:host-######"), (pathname: "##-########-#####-####-#####-#####/#####.#####-##########-##########-####-####-####-##########.########.#### #####.vmdk"), (flags: nfc-error, retriable): Class: NFC Code: 10; NFC error: NFC_DISKLIB_ERROR (Connection timed out); Set error flag: retriable; Set error flag: nfc-error; Can't write (multiEx) to remote disk; Can't write (multi) to remote disk
            at jdk.internal.reflect.GeneratedConstructorAccessor434.newInstance(Unknown Source)
            at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

Environment

VMware Cloud Director Availability 4.7.x

Cause

VMware Cloud Director Availability utilizes the Network File Copy (NFC) protocol to instruct the target ESXi host to write replicated data directly to the destination datastore. 

When the underlying storage takes too long to acknowledge these writes, exceeding ~60 seconds the ESXi host aborts the operation.

The ESXi host is successfully issuing I/O requests, but the storage backend (SAN/Array) or the physical fabric (FC/iSCSI) is failing to acknowledge or complete these requests within the default SCSI timeout periods.

These storage-level timeouts cascade to the replication layer, causing the NFC_DISKLIB_ERROR and subsequent replication failure.

Resolution

Please engage your storage team/vendor to check on the storage issues.

Additional Information

Please refer to related Knowledge Base Articles:
vSphere Replication: NFC_DISKLIB_ERROR (Connection timed out) due to Storage Latency
VMs have high latency and may freeze due to repeated D:0x28 (TASK_SET_FULL)