vMotion Fails for VMs running on VMFS-6 Datastore
search cancel

vMotion Fails for VMs running on VMFS-6 Datastore

book

Article ID: 338283

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • vMotion fails at 20-21%
  • Issue occurs on VMFS-6 datastores
  • vmware.log for affected VM indicates following errors:

2019-10-17T13:35:58.202Z| vmx| I125: Received migrate 'to' request for mid id #################### , src ip <xx.xxx.xxx.xx>, dst ip <##.###.###.##>(invalidate source config).
2019-10-17T13:35:58.203Z| vmx| A100: ConfigDB: Setting vmotion.checkpointSVGASize = "9961472"
2019-10-17T13:36:06.429Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx' failed: Lock timed out
2019-10-17T13:36:06.430Z| vmx| I125: Msg_Reset:
2019-10-17T13:36:06.430Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx": Failed to lock the file.
2019-10-17T13:37:36.622Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx' failed: Lock timed out
2019-10-17T13:37:36.623Z| vmx| I125: Msg_Reset:
2019-10-17T13:37:36.623Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx": Failed to lock the file.
2019-10-17T13:37:36.623Z| vmx| I125: ----------------------------------------
2019-10-17T13:37:36.623Z| vmx| W115: Migrate: Failed to write out config file.
2019-10-17T13:37:36.623Z| vmx| I125: Migrate: Caching migration error message list:
2019-10-17T13:37:36.623Z| vmx| I125: [msg.migrate.expired] Timed out waiting for migration start request.

  • VMkernel log on source host reports following errors:

2019-10-17T13:36:26.478Z cpu0:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
2019-10-17T13:36:26.478Z cpu0:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344
gen 4865, mode 1, owner ###########-###########-####-########### mtime 571270
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232
2019-10-17T13:36:30.484Z cpu4:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
2019-10-17T13:36:30.484Z cpu4:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344
gen 4865, mode 1, owner ###########-###########-####-########### mtime 571270
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232
 

  • VMkernel log on destination host reports following errors:

2019-10-17T13:35:45.767Z cpu19:2175769)HBX: 6416: 'DATASTORE_NAME': HB at offset 3551232 - Marking HB:
2019-10-17T13:35:45.767Z cpu19:2175769)  [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid ###########-###########-####-########### jrnl  drv 24.82 lockImpl 4 ip ##.###.###.##]
2019-10-17T13:36:01.768Z cpu18:2175769)HBX: 6433: 'DATASTORE_NAME': HB at offset 3551232 - Skipping replay as HB is being replayed by another live host:
2019-10-17T13:36:01.768Z cpu18:2175769)  [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid ###########-###########-####-########### jrnl  drv 24.82 lockImpl 4 ip ##.###.###.##]
2019-10-17T13:36:01.768Z cpu18:2175769)Res3: 2328: Rank violation threshold reached: cid 0xc1d00002, resType 1, cnum 5 vol DATASTORE_NAME

Environment

VMware vSphere ESXi 7.x

Cause

  • When a host loses storage connectivity/crashes/reboots abruptly, other hosts in the cluster can replay its journal in parallel, causing VMFS metadata inconsistencies.
  • When a host replaying a journal itself loses storage connectivity/crashes/reboots, the original host whose journal was being replayed, can never reclaim its heartbeat, leaving a hung lock

Resolution

Workaround:
  • Resolve the storage connectivity issue to prevent the issue.
  • Storage vMotion the VMs to VMFS-5 datastores.