HA does not failover a virtual machine when Storage vMotion of the virtual machine is in progress

Products

VMware vCenter Server

Issue/Introduction

This article describes a specific issue. If you experience all of the above symptoms, consult the sections below. If you experience some but not all of these symptoms, your issue is not related to this article. Search the Knowledge Base for your symptoms, or Open a Support Request.

Symptoms:

High Availability (HA) does not failover a virtual machine if Storage vMotion is in progress for that virtual machine
In the vpxd logs of vCenter Server, you see entries similar to:

vpxd-327.log:2012-01-24T07:17:20.884Z [02652 error 'VmProv' opID=AC12180D-0000134C-d7-45] error while tracking VMotion progress (vmodl.fault.HostCommunication)
vpxd-327.log:2012-01-24T07:17:22.243Z [02652 error 'VmProv' opID=AC12180D-0000134C-d7-45] [VMOTION_RECOVER] Failed to initiate VMotion on hosts: vmodl.fault.HostCommunication
vpxd-327.log:2012-01-24T07:17:22.243Z [02652 verbose 'vmprovvpxdTxnManager' opID=AC12180D-0000134C-d7-45] [VMOTION_RECOVER] VpxdTxnManager received request to mark VM ID vm-464 with state 2, src URL ds:///vmfs/volumes/4c51f5f2-6251c5b5-9b55-002219348b1a/SwingBench_Client_XP_1/SwingBench_Client_XP.vmx, dest URL ds:///vmfs/volumes/4c84767c-43c38bdc-a705-0024e850729a/SwingBench_Client_XP_2/SwingBench_Client_XP.vmx
vpxd-327.log:2012-01-24T07:17:23.102Z [02652 info 'VmProv' opID=AC12180D-0000134C-d7-45] [WorkflowXAction] Starting workflow rollback
.
.
vpxd-330.log:2012-01-24T07:22:15.554Z [02652 error 'VmProv' opID=AC12180D-0000134C-d7-45] [VMOTION_RECOVER] Failed to complete vmotion on hosts: vmodl.fault.HostCommunication
vpxd-331.log:2012-01-24T07:22:58.724Z [02652 info 'VmProv' opID=AC12180D-0000134C-d7-45] [WorkflowXAction] Caught exception vmodl.fault.HostCommunication while undoing action vpx.vmprov.CreateDestinationVm
vpxd-331.log:2012-01-24T07:23:19.957Z [02652 error 'DAS' opID=AC12180D-0000134C-d7-45] [VpxdDas::PostMigrateCallback] MarkSVmotionDone failed on host [vim.HostSystem:host-102,w1-fi040.eng.vmware.com] for VM /vmfs/volumes/4c51f5f2-6251c5b5-9b55-002219348b1a/SwingBench_Client_XP_1/SwingBench_Client_XP.vmx : class Vmomi::Fault::HostCommunication::Exception(vmodl.fault.HostCommunication). Skip
vpxd-331.log:2012-01-24T07:23:21.785Z [02652 info 'commonvpxLro' opID=AC12180D-0000134C-d7-45] [VpxLRO] -- FINISH task-internal-810605 -- -- VmprovWorkflow -- ^M
vpxd-331.log:2012-01-24T07:23:21.785Z [02652 info 'Default' opID=AC12180D-0000134C-d7-45] [VpxLRO] -- ERROR task-internal-810605 -- -- VmprovWorkflow: vmodl.fault.HostCommunication:
In the /var/log/fmd.log file of the cluster Fault Domain Manager (FDM) master, you see entries similar to:

2012-01-24T07:21:14.911Z [7935DB90 verbose 'Cluster' opID=host-30:3-53-SWI-8b9851f8] [ClusterDatastore::CheckIfPowerOffFileExistsWork] Checking for /vmfs/volumes/4c51f5f2-6251c5b5-9b55-002219348b1a/SwingBench_Client_XP_1/SwingBench_Client_XP.vmx
2012-01-24T07:21:14.916Z [7935DB90 verbose 'Cluster' opID=host-30:3-53-SWI-8b9851f8] [ClusterDatastore::CheckIfPowerOffFileExistsWork] /vmfs/volumes/4c51f5f2-6251c5b5-9b55-002219348b1a/SwingBench_Client_XP_1/SwingBench_Client_XP.vmx not found
2012-01-24T07:21:14.917Z [7935DB90 info 'Execution' opID=host-30:3-53-SWI-8b9851f8] [FailoverAction::PowerOffFileCheckCompletion] Aborting failover. vm /vmfs/volumes/4c51f5f2-6251c5b5-9b55-002219348b1a/SwingBench_Client_XP_1/SwingBench_Client_XP.vmx cleanly powered off and no power-off file. Assuming user power off
The virtual machine is in a disconnected state and does not power on correctly after the HA event

Environment

VMware vCenter Server 5.0.x

Cause

There is a small vulnerability window between the Storage vMotion completion and the old path being unprotected and the new path being protected. If the host fails within this window, the virtual machine fails to get restarted even after the host is up and running.

Resolution

This is a known issue in vCenter Server 5.0.

To workaround this issue:

Click Home and then click the Datastores and Datastore Clusters view.
Click the source datastore in the left pane.
Click the Virtual Machines tab.
Right-click the virtual machine that is in a disconnected state and click Remove from Inventory.
Right-click the destination datastore in the left pane and click Browse Datastore.
Navigate to the virtual machine location.
Right-click the .vmx file (configuration file) of the virtual machine and click Add to Inventory.
Power on the virtual machine.

If the issue persists, file a support request with VMware Support and note this Knowledge Base article ID in the problem description. For more information, see Filing a Support Request in Customer Connect (2006985) or How to Submit a Support Request.

Additional Information

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box How to file a Support Request in Customer Connect