During vSphere replication, virtual machine disks are experiencing hangs specifically when using VMware Tools version 11.x and later.

Products

VMware vSphere ESXi

Issue/Introduction

This issue is observed during replication. It could be vSphere replication or third-party replication solutions such as Veeam.

Symptoms:

After upgrading the VMware tools to 11.x version, the VM disks are going in-accessible during SRM replication.
When the replication for the Virtual machine is initiated, the data disk gets hung and will not be accessible until replication is stopped.
In the windows event logs, PVSCSI errors can be seen: “Event ID 129 Reset to device, \Device\RaidPort0, was issued.”
The issue is not observed when the virtual machine is running on vmtool version 10.2.x/10.3.x.

vmware.log

YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: HBACommon: First write on scsi1:0.fileName='/vmfs/volumes/Datastore_name/vm_folder/vm_name_1.vmdk'
YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: DDB: "longContentID" = "################################" (was "#############################")
YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x42a5825c, new=0x50124967 (###############################)
YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/Datastore_name/vm_folder/vm_name2.vmdk'
YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: DDB: "longContentID" = "############################" (was "################################")
YYYY-MM-DDTHH:MM:SSZ| vcpu-0| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x3b36b977, new=0x554f92be (################################)
YYYY-MM-DDTHH:MM:SSZ|vcpu-1| I125: HBACommon: First write on scsi0:2.fileName='/vmfs/volumes/Datastore_name/vm_folder/vm_name_2_2.vmdk'
YYYY-MM-DDTHH:MM:SSZ|vcpu-1| I125: DDB: "longContentID" = "############################" (was "#############################")
YYYY-MM-DDTHH:MM:SSZ|vcpu-1| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0xe76#####, new=0x7fa#### (#############################)
YYYY-MM-DDTHH:MM:SSZ| vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
YYYY-MM-DDTHH:MM:SSZ|vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
YYYY-MM-DDTHH:MM:SSZ|vmx| I125: GuestRpcSendTimedOut: message to toolbox timed out.
YYYY-MM-DDTHH:MM:SSZ| vmx| I125: GuestRpc: app toolbox's second ping timeout; assuming app is down

vmkernel.log (on the host where the VM is running)

YYYY-MM-DDTHH:MM:SSZ cpu1:5350368)Hbr: 1198: File hbrtmp.1.6660 (groupID=GID-########-####-####-####-############) (offset=0) already exists on server and is identical
YYYY-MM-DDTHH:MM:SSZ  cpu1:5350368)Hbr: 1198: File hbrtmp.2.41 (groupID=GID-########-####-####-####-############) (offset=0) already exists on server and is identical
YYYY-MM-DDTHH:MM:SSZ  cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-########-####-####-####-############) (numExtentsToTransfer=323760)
YYYY-MM-DDTHH:MM:SSZ  cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-########-####-####-####-############) (numExtentsToTransfer=7852153)
YYYY-MM-DDTHH:MM:SSZ  cpu7:2099708)Hbr: 2250: Prepared delta (diskID=RDID-########-####-####-####-############) (numExtentsToTransfer=22991067)
YYYY-MM-DDTHH:MM:SSZ  cpu24:5350431)J6: 2651: 'Prdsan05_56_PRDDB10_Log_SCUReport': Exiting async journal replay manager world

Enable debug log-in for VMware tools:

[YYYY-MM-DDTHH:MM:SSZ] [   debug] [vmsvc] VMTools_ConfigGetBoolean:Returning default value for '[guestinfo] diskinfo-report-uuid'=TRUE (Not founderr=4).
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{########-####-####-####-############}' is SCSI
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid
[YYYY-MM-DDTHH:MM:SSZ] [    info] [vmsvc] tools service recovered from a hang.
[YYYY-MM-DDTHH:MM:SSZ] [    info] [vmsvc] tools hang detector time sequence 1.00s, 1.02s, 1.01s, 1.02s, 1.00s.
[YYYY-MM-DDTHH:MM:SSZ] [    info] [vmsvc] tools service hung.
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{########-####-####-####-############}' is SCSI
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:'\\?\Volume{########-####-####-####-############}' is SCSI
[YYYY-MM-DDTHH:MM:SSZ] [   debug] [guestinfo] GetVolumeUUID:SerialNumberOffset(0) isn't valid

Environment

VMware vSphere ESXi 6.7
VMware Tools 11.x

Cause

The drive hangs while attempting to retrieve the disk UUIDs during the replication process, resulting in a timeout after 60 seconds.

Resolution

This is fixed in VMware tools version 11.1.5.

Workaround:

In tools.conf on the affected guest, enter

[guestinfo]
diskinfo-report-uuid=false

vmtoolsd will pick the change up within 5 seconds; A restart is not needed.