Replication delays and queuing behavior in VMware Cloud Director Availability

Products

VMware Cloud Director

Issue/Introduction

In VMware Cloud Director Availability, administrators may observe replication delays, queuing behavior, or RPO violations, particularly when replicating large VMs configured with aggressive RPO settings (e.g., 30 minutes).

These symptoms are often accompanied by failed or stalled replication tasks due to storage contention or network configuration mismatches.

/var/log/vmware/hbrsrv.log shows the below messages:

NfcFssrvrWriteCB: Failed to write 131072 bytes … : Device or resource busy
NFC_DISKLIB_ERROR (Device or resource busy)
Storage is locked
REPLICA_UPDATE … took 19s to complete (Expected <10s).

29-09-2025T01:32:58.913Z info hbrsrv[2748451] [Originator@6876 sub=Libs] [NFC ERROR]
NfcFssrvrProcessErrorMsg: received diskLib error 1048585 from server:
NfcFssrvrWriteCB: Failed to write 24576 bytes @ 44800761856 : Device or resource busy
RemoteDisk: Closing path … hbrdisk…vmdk
Destroying NFC connection to host-#####.
Nfc_CloseSessionEx: session=7F6ED802B500
NfcSessionStats: sessionDurationUs=25367681, attemptedFileTransfers=0, successfulFileTransfers=0, totalBytesTransferred=0
netRecvLatencyStats: count 9 min/max/avg 54/25179864/2818385

Environment

VMware Cloud Director Availability 4.7.3

Cause

Replication delays and queuing occur due to a combination of storage contention, aggressive RPO configuration, and non-uniform MTU settings between ESXi hosts and VCDA appliances.

Log Indicators

Log Message	Meaning
`NfcFssrvrWriteCB: Failed to write … : Device or resource busy`	Indicates storage contention or locking during replication writes.
`REPLICA_UPDATE … took 19s to complete (Expected <10s)`	Suggests replication task exceeding expected completion window.
`netRecvLatencyStats: count 9 min/max/avg 54/25179864/2818385`	Reflects network latency spikes impacting replication throughput.

Resolution

Replication delays and queuing in VMware Cloud Director Availability occur due to a combination of aggressive RPO settings, large VM size, and non-uniform network configurations between ESXi hosts and VCDA appliances.

Aggressive RPO and Retention Policy Misalignment

When the configured Recovery Point Objective is shorter than the time required to complete a replication cycle, replication jobs can begin to overlap. This often occurs when the retention policy is set to maintain multiple instances over long intervals while using an aggressive (short) RPO. As a result, replication tasks may queue up, increasing system load and causing potential RPO violations.

Adjust the RPO to a value that aligns with the replication duration and retention interval.
For example, increasing the RPO to 60 minutes when retaining hourly instances allows each cycle to complete fully before the next begins. This helps prevent overlap, queuing, and delayed replication tasks.

Large Virtual Machines and Concurrent Replications

Replicating multiple large or high I/O virtual machines concurrently can lead to storage and network contention. This may result in replication delays or errors such as “device or resource busy” in the logs. Each replication process consumes bandwidth and storage I/O resources, and when many large replications run simultaneously, they can exceed available capacity.

Stagger replication schedules to avoid multiple large VMs replicating at the same time.
Ensure that the configured RPO provides sufficient time for replication jobs to complete (RPO ≥ backup or instance interval).
If feasible, allocate dedicated network resources or bandwidth for replication traffic to minimize congestion and improve performance.

Non-Uniform MTU Configuration

Inconsistent MTU (Maximum Transmission Unit) settings across components — such as ESXi hosts, VCDA appliances, and network devices — can lead to packet fragmentation and reassembly overhead. This inconsistency increases network latency and can slow down replication traffic, especially during high data transfer operations.

Ensure consistent MTU settings across the entire replication path.

Use a uniform MTU size (e.g., 1500 for standard frames or 9000 for jumbo frames) across ESXi hosts, VCDA appliances, and intermediate network devices. Consistent configuration prevents fragmentation and improves overall network throughput and replication stability

Issue	Resolution	Benefit
Aggressive RPO (for ex: 30 mins)	Increase RPO to 60 mins	Prevents replication overlap and queuing
Large VMs (for ex: 80 GB+)	Stagger replication or increase RPO	Reduces I/O contention and resource lock errors
MTU mismatch (for ex: 9000 vs. 1500)	Standardize MTU across all components	Eliminates packet fragmentation and latency

To Increase RPO, follow the below steps:

Log in to the VMware Cloud Director Availability UI (either the On-Prem or Cloud site).

Example: https://<vcda-manager-fqdn>/ui/

Go to Replications → Outgoing Replications (for source site) or Incoming Replications (for target site).

Locate the replication you want to modify.

You can use the search bar or filter by status, source VM, or organization.

Click on the replication name to open its details view.

From the "Actions" menu go to Recovery Settings section, find the field labeled RPO (Recovery Point Objective).

Increase the RPO value to a higher interval.

For example, change 30 minutes → 60 minutes (or higher, based on your retention and network/storage capacity).

Review the retention policy (number of instances) to ensure it aligns with the new RPO.

Example: If you retain 24 hourly instances, the RPO should be ≥ 60 minutes.

Save/Apply the changes.

The replication will now follow the new RPO schedule, and the next sync will occur accordingly.

Additional Information

Reference Articles

Best Practices When Increasing RPO

Align RPO with your retention policy to prevent overlap (e.g., hourly retention → 60 min RPO).
Avoid overly aggressive RPOs for large VMs (>50–80 GB) or high I/O workloads.
Ensure network bandwidth and storage performance can handle concurrent replication jobs.