vMotion Fails for virtual machines running on VMFS-6 Datastore
search cancel

vMotion Fails for virtual machines running on VMFS-6 Datastore

book

Article ID: 338283

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • vMotion of virtual machines fails at 20-21%.
  • Issue is specific to VMFS-6 datastores
  • /vmfs/volumes/<Datastore name>/<VM name>/vmware.log for affected VM indicates following errors:

yyyy-mm-ddThh:mm:ss.Z| vmx| I125: Received migrate 'to' request for mid id #################### , src ip <xx.xxx.xxx.xx>, dst ip <##.###.###.##>(invalidate source config).
yyyy-mm-ddThh:mm:ss.Z| vmx| A100: ConfigDB: Setting vmotion.checkpointSVGASize = "9961472"
yyyy-mm-ddThh:mm:ss.Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx' failed: Lock timed out
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: Msg_Reset:
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx": Failed to lock the file.
yyyy-mm-ddThh:mm:ss.Z| vmx| W115: FILE: FileIO_Lock on '/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx' failed: Lock timed out
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: Msg_Reset:
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/###########-###########-####-###########/VMNAME/VMNAME.vmx": Failed to lock the file.
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: ----------------------------------------
yyyy-mm-ddThh:mm:ss.Z| vmx| W115: Migrate: Failed to write out config file.
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: Migrate: Caching migration error message list:
yyyy-mm-ddThh:mm:ss.Z| vmx| I125: [msg.migrate.expired] Timed out waiting for migration start request.

  • /var/run/log/vmkernel.log on source host reports following errors:

yyyy-mm-ddThh:mm:ss.Z cpu0:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
yyyy-mm-ddThh:mm:ss.Z cpu0:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344 gen 4865, mode 1, owner ###########-###########-####-########### mtime 571270 num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232
yyyy-mm-ddThh:mm:ss.Z cpu4:2100106)DLX: 4949: vol 'DATASTORE_NAME', lock at 154738688: [Req mode: 1] Not free:
yyyy-mm-ddThh:mm:ss.Z cpu4:2100106)[type 10c00001 offset 154738688 v 4701, hb offset 3641344 gen 4865, mode 1, owner ###########-###########-####-########### mtime 571270 num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 4063232

  • /var/run/log/vmkernel.log  on destination host reports following errors:

yyyy-mm-ddThh:mm:ss.Z cpu19:2175769)HBX: 6416: 'DATASTORE_NAME': HB at offset 3551232 - Marking HB:
yyyy-mm-ddThh:mm:ss.Z cpu19:2175769)  [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid ###########-###########-####-########### jrnl  drv 24.82 lockImpl 4 ip ##.###.###.##]
yyyy-mm-ddThh:mm:ss.Z cpu18:2175769)HBX: 6433: 'DATASTORE_NAME': HB at offset 3551232 - Skipping replay as HB is being replayed by another live host:
yyyy-mm-ddThh:mm:ss.Z cpu18:2175769)  [HB state abcdef04 offset 3551232 gen 7 stampUS 131655160 uuid ###########-###########-####-########### jrnl  drv 24.82 lockImpl 4 ip ##.###.###.##]
yyyy-mm-ddThh:mm:ss.Z cpu18:2175769)Res3: 2328: Rank violation threshold reached: cid 0xc1d00002, resType 1, cnum 5 vol DATASTORE_NAME

Environment

VMware vSphere ESXi 7.x

Cause

  • When a host loses storage connectivity/crashes/reboots abruptly, other hosts in the cluster can replay its journal in parallel, causing VMFS metadata inconsistencies.
  • When a host replaying a journal itself loses storage connectivity/crashes/reboots, the original host whose journal was being replayed, can never reclaim its heartbeat, leaving a hung lock

Resolution

  • Resolve the storage connectivity issues.
  • As a workaround, storage vMotion the VMs to VMFS-5 datastores while the storage issues are being addressed.