ERROR RPO violation - Disk consolidation in progress 0% done

Products

VMware Live Recovery

Issue/Introduction

Symptoms

The status of the replication job was no longer cycling through the consolidation process and was sitting at 0% progress.

In host log, there's "REPLICA_UPDATE" error

2025-04-22T15:19:02.237Z Wa(180) vmkwarning: cpu68:15613361)WARNING: Hbr: 3649: Command REPLICA_UPDATE failed (result=Failed) (isFatal=FALSE) (Id=309567938) (GroupID=GID-72c9edf1-34b0-4448-aa28-1a36d62cfd75)
2025-04-22T15:19:02.639Z Wa(180) vmkwarning: cpu38:15613361)WARNING: Hbr: 3649: Command REPLICA_UPDATE failed (result=Failed) (isFatal=FALSE) (Id=309568098) (GroupID=GID-72c9edf1-34b0-4448-aa28-1a36d62cfd75)
2025-04-22T15:19:07.911Z Wa(180) vmkwarning: cpu62:15613361)WARNING: Hbr: 3649: Command REPLICA_UPDATE failed (result=Failed) (isFatal=FALSE) (Id=309568241) (GroupID=GID-72c9edf1-34b0-4448-aa28-1a36d62cfd75)
2025-04-22T15:19:08.282Z Wa(180) vmkwarning: cpu62:15613361)WARNING: Hbr: 3649: Command REPLICA_UPDATE failed (result=Failed) (isFatal=FALSE) (Id=309568347) (GroupID=GID-72c9edf1-34b0-4448-aa28-1a36d62cfd75)

On the target Esxi host, under /var/run/log/hbrsrv.log

2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main] HbrError for (datastoreUUID: "67ffd464-0db69ec4-7d71-0025b5696953"), (pathname: "TestVM/hbrdisk.RDID-caca3e0f-7c5a-4eff-b866-b25e1b60b0d3.2108.72541692687570.vmdk"), (flags: disklib-error) stack:
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [0] Set error flag: disklib-error
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [1] Class: DiskLib Code: 13
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [2] DiskLib error: There is not enough space on the file system for the selected operation
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [3] Code set to: Insufficient storage space.
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [4] DiskLib operation (DiskLib_Write async) failed
2025-04-22T14:25:16.992Z Er(163) hbrsrv[5020520]: [Originator@6876 sub=Main]    [5] Converting error to wire failure
2025-04-22T14:25:16.993Z Db(167) hbrsrv[5020555]: [Originator@6876 sub=LocalDisk] CID changed for disk (/vmfs/volumes/67ffd464-0db69ec4-7d71-0025b5696953/TestVM/hbrdisk.RDID-caca3e0f-7c5a-4eff-b866-b25e1b60b0d3.2108.72541692687570.vmdk) (prev='0d30b543693f90aef5513b4eb3b7864f') (current=0d30b543693f90aef5513b4eb3b7864f)

Environment

VMware vSphere Replication 8.x.x
VMware vSphere Replication 9.x.x

Cause

vSphere Replication requires enough disk space at the target site to replicate a VM.

If the available space is not enough to save the replication files, the replication might fail.

Resolution

Recovery Point Objective (RPO) violation means that the replication process is not keeping up with changes on the source VM.

The stalled progress indicates that the disk consolidation, and potentially the replication process, is not working correctly.

• Examine logs on the Source ESXi host (/var/run/log/vmkernel.log), the vSphere Replication server (/var/log/vmware/hbrsrv.log)

• Examine logs on the Target Esxi hosts for tasks "consolidateStart" "consolidateProgress" & "consolidateComplete" in (/var/run/log/hbrsrv.log)

• Ensure the network connection between the source ESXi host and the target vSphere Replication server (or datastore) is working correctly.

• Ensure that the datastore on the target vSphere Replication server has sufficient free space.

• Disk consolidation can sometimes fail if there isn't enough space to merge the virtual disks.

Additional Information

When replication is first configured, vSphere Replication performs a full sync – it sends all of the data that makes up the virtual machine to the target location to create the base disk of the replica.

After the initial full sync, only changed data is replicated – this process is typically called a delta sync. While a delta sync is in progress, the replicated data is stored in one or more redo logs at the target location. Redo logs are used to preserve the integrity of the replica. Once replication is complete, a new redo log is created for the next replication cycle. The old redo log is consolidated into the base disk.

Consolidation may take very long time for huge disks.

It depends how that data is spread across the disk, and across the layers of the redo-log chain.

vSphere Replication itself handles the consolidation of redo logs into the base disk or other redo logs, especially when MPIT is enabled.