[INTERNAL] vMotion Fails for VMs running on VMFS-6 Datastore

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

vMotion fails at 20-21%
Issue occurs on ESXi 6.7 hosts and VMFS-6 datastores
vmware.log for affected VM indicates following errors:

2019-10-17T13:35:58.202Z| vmx| I125: Received migrate 'to' request for mid id 2384187699948208900, src ip <xx.xxx.xxx.xx>, dst ip <xx.xxx.xxx.xx>(invalidate source config).
2019-10-17T13:35:58.203Z| vmx| A100: ConfigDB: Setting vmotion.checkpointSVGASize = "9961472"
2019-10-17T13:36:06.429Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VMNAME/VMNAME.vmx' failed: Lock timed out
2019-10-17T13:36:06.430Z| vmx| I125: Msg_Reset:
2019-10-17T13:36:06.430Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VMNAME/VMNAME.vmx": Failed to lock the file.
2019-10-17T13:37:36.622Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VMNAME/VMNAME.vmx' failed: Lock timed out
2019-10-17T13:37:36.623Z| vmx| I125: Msg_Reset:
2019-10-17T13:37:36.623Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VMNAME/VMNAME.vmx": Failed to lock the file.
2019-10-17T13:37:36.623Z| vmx| I125: ----------------------------------------
2019-10-17T13:37:36.623Z| vmx| W115: Migrate: Failed to write out config file.
2019-10-17T13:37:36.623Z| vmx| I125: Migrate: Caching migration error message list:
2019-10-17T13:37:36.623Z| vmx| I125: [msg.migrate.expired] Timed out waiting for migration start request.

VMkernel log on source host reports following errors:

2019-10-17T13:36:26.478Z cpu0:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
2019-10-17T13:36:26.478Z cpu0:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344
gen 4865, mode 1, owner XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX mtime 571270
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232
2019-10-17T13:36:30.484Z cpu4:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
2019-10-17T13:36:30.484Z cpu4:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344
gen 4865, mode 1, owner XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX mtime 571270
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232

VMkernel log on destination host reports following errors:

2019-10-17T13:35:45.767Z cpu19:2175769)HBX: 6416: 'DATASTORE_NAME': HB at offset 3551232 - Marking HB:
2019-10-17T13:35:45.767Z cpu19:2175769) [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX jrnl drv 24.82 lockImpl 4 ip xx.xxx.xxx.xx]
2019-10-17T13:36:01.768Z cpu18:2175769)HBX: 6433: 'DATASTORE_NAME': HB at offset 3551232 - Skipping replay as HB is being replayed by another live host:
2019-10-17T13:36:01.768Z cpu18:2175769) [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX jrnl drv 24.82 lockImpl 4 ip xx.xxx.xxx.xx]
2019-10-17T13:36:01.768Z cpu18:2175769)Res3: 2328: Rank violation threshold reached: cid 0xc1d00002, resType 1, cnum 5 vol DATASTORE_NAME

Environment

VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 6.7

Cause

When a host loses storage connectivity/crashes/reboots abruptly, other hosts in the cluster can replay its journal in parallel, causing VMFS metadata inconsistencies.
When a host replaying a journal itself loses storage connectivity/crashes/reboots, the original host whose journal was being replayed, can never reclaim its heartbeat, leaving a hung lock

Resolution

VMware Engineering is aware of this issue and is working on a fix.

Workaround:
Resolve the storage connectivity issue to prevent the issue.
Storage vMotion the VMs to VMFS-5 datastores.