Replication sometimes failed due to disk full by unexpectedly increasing replicated data
search cancel

Replication sometimes failed due to disk full by unexpectedly increasing replicated data

book

Article ID: 342605

calendar_today

Updated On:

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

Symptoms:
  • Enabling replication for a VM to the same site or cluster, destination datastore ( Replicated datastore ) space grown to maximum and replication failed with an error:
  • In the hostd.log file, you see entries similar to:

    2016-08-31T03:42:06.389Z error hostd[73940B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk: DiskLib failed to open path /vmfs/volumes/55342262-73e7cf76-efa3-000af7727378/SILV-MENTUM-01/SILV-MENTUM-01.vmdk(diskID=RDID-0f6caf00-0177-4af0-b46a-282c36461f57) (vmID=4) (groupID=GID-4d485f26-c579-4b06-8d31-85b28e09c0f4): Failed to lock the file. retry open disk, passed retry time=10 seconds
    …..
    2016-08-31T03:42:41.042Z error hostd[73940B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk: DiskLib failed to open path /vmfs/volumes/55342262-73e7cf76-efa3-000af7727378/SILV-MENTUM-01/SILV-MENTUM-01.vmdk(diskID=RDID-0f6caf00-0177-4af0-b46a-282c36461f57) (vmID=4) (groupID=GID-4d485f26-c579-4b06-8d31-85b28e09c0f4): Failed to lock the file. retry open disk, passed retry time=40 seconds
    2016-08-31T03:42:41.052Z info hostd[73940B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1378 : Sync started by VR Scheduler for virtual machine SILV-MENTUM-01 on host SILV-ESX-01.vmware.com in cluster SILV-ESX-01.vmware.com in ha-datacenter.

    2016-01T06:01:25.020Z info hostd[720C1B70] [Originator@6876 sub=Hbrsvc opID=cffd3655 user=System] HbrReconfigureInterceptor checking HBR-enabled config for VM 4 (SILV-MENTUM-01)
    2016-09-01T06:01:25.919Z info hostd[737C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/55342262-73e7cf76-efa3-000af7727378/SILV-MENTUM-01/SILV-MENTUM-01.vmx opID=cffd3655 user=System] State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
    2016-09-01T06:01:25.921Z info hostd[737C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=cffd3655 user=System] Event 1397 : Reconfigured SILV-MENTUM-01 on SILV-ESX-01.vmware.com in ha-datacenter
    2016-09-01T06:01:25.921Z info hostd[737C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/55342262-73e7cf76-efa3-000af7727378/SILV-MENTUM-01/SILV-MENTUM-01.vmx opID=cffd3655 user=System] Send config update invoked
    2016-09-01T06:01:44.778Z error hostd[72BCDB70] [Originator@6876 sub=Hbrsvc opID=ae930a55-d57d-48c5-aeeb-4c8fc0b9722d-HMSINT-18137-29-87-368f user=vpxuser:com.vmware.vcHms] Failed to retrieve replication configuration for VM 4 (SILV-MENTUM-01): replication not enabled
    2016-09-19T06:35:58.155Z error hostd[FFA7AAE0] [Originator@6876 sub=Hbrsvc opID=7f6570ad-9af8-4957-9b46-e6f7e6e52da7-HMSINT-30-c2-96-d1aa user=vpxuser:com.vmware.vcHms] Failed to retrieve replication configuration for VM 4 (SILV-MENTUM-01): replication not enabled.


    Note: This log excerpt is an example. Date, time, and environmental variables may vary depending on your environment.
  • From the preceding log entries we see the replication for the VM failed due to lock but when we review the size of the datastore there is no more space to write on that datastore.
     


Environment

VMware vSphere Replication 6.5.x
VMware vSphere Replication 6.1.x
VMware vSphere Replication 6.0.x

Cause

This issue occurs when we enable replication for a VM to the same site. This issue has been observed widely when source VM has thin disk.
 
Replication Datastore going out of space may depend on actual data on the source disk, it may takes few days to weeks to fill the destination datastore.
 
Changing datastore or datastore type will not help, i.e, local datastore to iscsi or to fiber, will not change the result
 
Even though source base disk is 20 or 30 GB, hbr-disk gets full on the disk, found unmap command which try to full fill all the disks.

Resolution

To resolve this issue, disable Unmap in the Guest OS by running this command:

DisableDeleteNotify=1
 
Where:
  • 0 - indicates that the Trim and Unmap feature is on (enabled)
  • 1 - indicates the Trim and Unmap feature is off (disabled)

To work around this issue, stop replication and ensure that the data on remote site is deleted and re-configure replication for the VM.
 


Additional Information

Impact/Risks:
Destination datastore space gets full.