Replication sometimes failed due to disk full by unexpectedly increasing replicated data

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

Symptoms:

Enabling replication for a VM to the same site or cluster, destination datastore ( Replicated datastore ) space grown to maximum and replication failed with an error:
In the hostd.log file, you see entries similar to:

[YYYY-MM-DDTHH:MM:SS] error hostd[73940B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk: DiskLib failed to open path /vmfs/volumes/55342262-73e7cf76-####-##########78/VM/VM-01.vmdk(diskID=RDID-0f6caf00-0177-4af0-b46a-282c36461f57) (vmID=4) (groupID=GID-4d485f26-c579-4b06-8d31-85b28e09c0f4): Failed to lock the file. retry open disk, passed retry time=10 seconds
…..
[YYYY-MM-DDTHH:MM:SS] error hostd[73940B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk: DiskLib failed to open path /vmfs/volumes/55342262-73e7cf76-####-##########78/VM/VM-01.vmdk(diskID=RDID-0f6caf00-0177-4af0-b46a-282c36461f57) (vmID=4) (groupID=GID-4d485f26-c579-4b06-8d31-85b28e09c0f4): Failed to lock the file. retry open disk, passed retry time=40 seconds
[YYYY-MM-DDTHH:MM:SS] info hostd[73940B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1378 : Sync started by VR Scheduler for virtual machine VM on host HOST.example.com in cluster HOST.example.com in ha-datacenter.

[YYYY-MM-DDTHH:MM:SS] info hostd[720C1B70] [Originator@6876 sub=Hbrsvc opID=cffd3655 user=System] HbrReconfigureInterceptor checking HBR-enabled config for VM 4 (VM)
[YYYY-MM-DDTHH:MM:SS] info hostd[737C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/55342262-73e7cf76-####-##########78/VM/VM-01.vmx opID=cffd3655 user=System] State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
[YYYY-MM-DDTHH:MM:SS] info hostd[737C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=cffd3655 user=System] Event 1397 : Reconfigured VM on HOST.example.com in ha-datacenter
[YYYY-MM-DDTHH:MM:SS] info hostd[737C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/55342262-73e7cf76-####-##########78/VM/VM-01.vmx opID=cffd3655 user=System] Send config update invoked
[YYYY-MM-DDTHH:MM:SS] error hostd[72BCDB70] [Originator@6876 sub=Hbrsvc opID=ae930a55-d57d-48c5-aeeb-4c8fc0b9722d-HMSINT-18137-29-87-368f user=vpxuser:com.vmware.vcHms] Failed to retrieve replication configuration for VM 4 (VM): replication not enabled
[YYYY-MM-DDTHH:MM:SS] error hostd[FFA7AAE0] [Originator@6876 sub=Hbrsvc opID=7f6570ad-9af8-4957-9b46-e6f7e6e52da7-HMSINT-30-c2-96-d1aa user=vpxuser:com.vmware.vcHms] Failed to retrieve replication configuration for VM 4 (VM): replication not enabled.

Note: This log excerpt is an example. Date, time, and environmental variables may vary depending on your environment.
From the preceding log entries we see the replication for the VM failed due to lock but when we review the size of the datastore there is no more space to write on that datastore.

Environment

VMware vSphere Replication 6.5.x
VMware vSphere Replication 6.1.x
VMware vSphere Replication 6.0.x

Cause

This issue occurs when we enable replication for a VM to the same site. This issue has been observed widely when source VM has thin disk.

Replication Datastore going out of space may depend on actual data on the source disk, it may takes few days to weeks to fill the destination datastore.

Changing datastore or datastore type will not help, i.e, local datastore to iscsi or to fiber, will not change the result

Even though source base disk is 20 or 30 GB, hbr-disk gets full on the disk, found unmap command which try to full fill all the disks.

Resolution

To resolve this issue, disable Unmap in the Guest OS by running this command:

DisableDeleteNotify=1

Where:

0 - indicates that the Trim and Unmap feature is on (enabled)
1 - indicates the Trim and Unmap feature is off (disabled)

To work around this issue, stop replication and ensure that the data on remote site is deleted and re-configure replication for the VM.

Additional Information

Impact/Risks:
Destination datastore space gets full.