Virtual machine goes to "Unknown" state during a vMotion task.
search cancel

Virtual machine goes to "Unknown" state during a vMotion task.

book

Article ID: 386823

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

  • A few migrations of a VM to another host halt at 19%. The VM is running on the source host and has status unknown on the destination host. 
  • From the vmware logs of the VM /vmfs/volumes/<Datastore-UUID>/<Virtual Machine Name>/vmware.log:

YYYY-MM-DD HH:MM:SS.468Z In(05) vmx - Migrate_SetFailureMsgList: Now in new log file.
YYYY-MM-DD HH:MM:SS.475Z In(05) vmx - Migrate: Caching migration error message list:
YYYY-MM-DD HH:MM:SS.475Z In(05) vmx - [msg.checkpoint.migration.maxSwitchoverTimeExceeded] The migration has exceeded the maximum switchover time of 100 second(s). ESX has preemptively failed the migration to allow the VM to continue running on the source.  To avoid this failure, either increase the maximum allowable switchover time or wait until the VM is performing a less intensive workload.

The VM attempted to continue to run on the source but failed due to file locks:
YYYY-MM-DD HH:MM:SS.475Z In(05) vmx - Migrate: Attempting to continue running on the source.
YYYY-MM-DD HH:MM:SS.476Z In(05) vmx - Checkpoint_Unstun: vm stopped for 100013264 us
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.477Z In(05) vcpu-0 - DISK: OPEN scsi0:0 '/vmfs/volumes/datastore-UUID/vm_name/vm_name.vmdk' persistent R[]
YYYY-MM-DD HH:MM:SS.482Z In(05) vcpu-0 - AIOGNRC: Failed to open '/vmfs/volumes/datastore-UUID/vm_name/I-vm_name-flat.vmdk' : Failed to lock the file (40003) (0x42003).

  • Source host logs:

/var/log/vmkernel.log:
YYYY-MM-DD HH:MM:SS.574Z In(182) vmkernel: cpu6:2130511 opID=8fae8402)World: 12321: VC opID lohfb8jc-283523-auto-62ro-h5:70081399-98-01-da-0ed5 maps to vmkernel opID 8fae8402
YYYY-MM-DD HH:MM:SS.574Z In(182) vmkernel: cpu6:2130511 opID=8fae8402)Config: 761: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)

/var/log/hostd.log:
YYYY-MM-DD HH:MM:SS.575Z In(166) Hostd[2130511]: [Originator@6876 sub=Vcsvc.VMotion opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-da-0ed5 sid=52ad70f2 user=vpxuser:x] Completed scheduling PrepareSourceEx [8289784752783297602]
...
YYYY-MM-DD HH:MM:SS.581Z In(166) Hostd[2130279]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx] VigorMigrateNotifyCb:: hostlog state changed from emigrating to failure
YYYY-MM-DD HH:MM:SS.581Z Db(167) Hostd[2130279]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx] VigorMigrateNotifyCb:: VMotionStatusCb queued
YYYY-MM-DD HH:MM:SS.582Z Db(167) Hostd[2130279]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx] VMotionStatusCb [8289784752783297602]: Failed with error [N3Vim5Fault20GenericVmConfigFaultE:0x0000003135354760]

  • Destination host logs:

Checking on the destination we can see that the migration was initiated on that side but hostd fails to connect to the VM due to 'Failed to establish transport connection':

/var/log/hostd.log:
YYYY-MM-DD HH:MM:SS.595Z In(166) Hostd[2104989]: [Originator@6876 sub=Vcsvc.VMotion opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-98-7043 sid=52ac3759 user=vpxuser:x] PrepareDestinationEx [8289784752783297602]
...
YYYY-MM-DD HH:MM:SS.613Z In(166) Hostd[2104992]: [Originator@6876 sub=Libs opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-82-7048 sid=52ac3759 user=vpxuser:x] VigorTransport_ClientConnectLocal: Connecting to local VM /vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx
...
YYYY-MM-DD HH:MM:SS.617Z In(166) Hostd[2104992]: [Originator@6876 sub=Libs opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-82-7048 sid=52ac3759 user=vpxuser:x] /vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx: Failed to connect to path vmx-vigor-fd.
YYYY-MM-DD HH:MM:SS.617Z In(166) Hostd[2104992]: [Originator@6876 sub=Libs opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-82-7048 sid=52ac3759 user=vpxuser:x] CnxUtil_ConnectLocal: Failed to find socket vmx-vigor-fd
YYYY-MM-DD HH:MM:SS.617Z In(166) Hostd[2104992]: [Originator@6876 sub=Libs opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-82-7048 sid=52ac3759 user=vpxuser:x] VigorTransport_ClientConnectLocal: Failed to connect to VM /vmfs/volumes/datastore-UUID/vm_name/vm_name.vmx
YYYY-MM-DD HH:MM:SS.618Z Db(167) Hostd[2104992]: [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/datastore-UUID/LC-VoO-Reserv/LC-VoO-Reserv.vmx opID=lohfb8jc-283523-auto-62ro-h5:70081399-98-01-82-7048 sid=52ac3759 user=vpxuser:KID\QRPM] Vigor client state callback OFFLINE reason Failed to open a local pipe to a local virtual machine '/vmfs/volumes/347d0fc0-1b8dcdb4/vm_name/vm_name.vmx'.
YYYY-MM-DD HH:MM:SS.618Z Db(167) Hostd[2104735]: --> Failed to establish transport connection.

Environment

VMware vSphere ESXi 8.0 U2.

Cause

Issue occurs due to a potential deadlock situation on the *.vmx (VM configuration). 

Resolution

The issue is fixed in ESXi 8.0U3b.

If the customer is not in a position to patch/upgrade the ESXi host, we have to kill the VM process on the source host, delete all .lck files in the VM's folder, and then restart vpxa and hostd on both hosts. Then note down the location of the .vmx file in the datastore, remove the VM from inventory, and add it back.

Additional Information