In VMware Cloud Director Availability, administrators may observe replication delays, queuing behavior, or RPO violations, particularly when replicating large VMs configured with aggressive RPO settings (e.g., 30 minutes).
These symptoms are often accompanied by failed or stalled replication tasks due to storage contention or network configuration mismatches.
/var/log/vmware/hbrsrv.log shows the below messages:
NfcFssrvrWriteCB: Failed to write 131072 bytes … : Device or resource busyNFC_DISKLIB_ERROR (Device or resource busy)Storage is lockedREPLICA_UPDATE … took 19s to complete (Expected <10s).
29-09-2025T01:32:58.913Z info hbrsrv[2748451] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrProcessErrorMsg: received diskLib error 1048585 from server:NfcFssrvrWriteCB: Failed to write 24576 bytes @ 44800761856 : Device or resource busyRemoteDisk: Closing path … hbrdisk…vmdkDestroying NFC connection to host-#####.Nfc_CloseSessionEx: session=7F6ED802B500NfcSessionStats: sessionDurationUs=25367681, attemptedFileTransfers=0, successfulFileTransfers=0, totalBytesTransferred=0netRecvLatencyStats: count 9 min/max/avg 54/25179864/2818385
VMware Cloud Director Availability 4.7.3
Replication delays and queuing occur due to a combination of storage contention, aggressive RPO configuration, and non-uniform MTU settings between ESXi hosts and VCDA appliances.
Log Indicators
| Log Message | Meaning |
|---|---|
NfcFssrvrWriteCB: Failed to write … : Device or resource busy |
Indicates storage contention or locking during replication writes. |
REPLICA_UPDATE … took 19s to complete (Expected <10s) |
Suggests replication task exceeding expected completion window. |
netRecvLatencyStats: count 9 min/max/avg 54/25179864/2818385 |
Reflects network latency spikes impacting replication throughput. |
Replication delays and queuing in VMware Cloud Director Availability occur due to a combination of aggressive RPO settings, large VM size, and non-uniform network configurations between ESXi hosts and VCDA appliances.
When the configured Recovery Point Objective is shorter than the time required to complete a replication cycle, replication jobs can begin to overlap. This often occurs when the retention policy is set to maintain multiple instances over long intervals while using an aggressive (short) RPO. As a result, replication tasks may queue up, increasing system load and causing potential RPO violations.
Adjust the RPO to a value that aligns with the replication duration and retention interval.
For example, increasing the RPO to 60 minutes when retaining hourly instances allows each cycle to complete fully before the next begins. This helps prevent overlap, queuing, and delayed replication tasks.
Replicating multiple large or high I/O virtual machines concurrently can lead to storage and network contention. This may result in replication delays or errors such as “device or resource busy” in the logs. Each replication process consumes bandwidth and storage I/O resources, and when many large replications run simultaneously, they can exceed available capacity.
Stagger replication schedules to avoid multiple large VMs replicating at the same time.
Ensure that the configured RPO provides sufficient time for replication jobs to complete (RPO ≥ backup or instance interval).
If feasible, allocate dedicated network resources or bandwidth for replication traffic to minimize congestion and improve performance.
Inconsistent MTU (Maximum Transmission Unit) settings across components — such as ESXi hosts, VCDA appliances, and network devices — can lead to packet fragmentation and reassembly overhead. This inconsistency increases network latency and can slow down replication traffic, especially during high data transfer operations.
Ensure consistent MTU settings across the entire replication path.
Use a uniform MTU size (e.g., 1500 for standard frames or 9000 for jumbo frames) across ESXi hosts, VCDA appliances, and intermediate network devices. Consistent configuration prevents fragmentation and improves overall network throughput and replication stability
| Issue | Resolution | Benefit |
|---|---|---|
| Aggressive RPO (for ex: 30 mins) | Increase RPO to 60 mins | Prevents replication overlap and queuing |
| Large VMs (for ex: 80 GB+) | Stagger replication or increase RPO | Reduces I/O contention and resource lock errors |
| MTU mismatch (for ex: 9000 vs. 1500) | Standardize MTU across all components | Eliminates packet fragmentation and latency |
Log in to the VMware Cloud Director Availability UI (either the On-Prem or Cloud site).
Example: https://<vcda-manager-fqdn>/ui/
Go to Replications → Outgoing Replications (for source site) or Incoming Replications (for target site).
Locate the replication you want to modify.
You can use the search bar or filter by status, source VM, or organization.
Click on the replication name to open its details view.
From the "Actions" menu go to Recovery Settings section, find the field labeled RPO (Recovery Point Objective).
Increase the RPO value to a higher interval.
For example, change 30 minutes → 60 minutes (or higher, based on your retention and network/storage capacity).
Review the retention policy (number of instances) to ensure it aligns with the new RPO.
Example: If you retain 24 hourly instances, the RPO should be ≥ 60 minutes.
Save/Apply the changes.
The replication will now follow the new RPO schedule, and the next sync will occur accordingly.
Reference Articles
Best Practices When Increasing RPO