DRS migrates all VMs to a single ESXi host after reverting vCenter Server to a snapshot
search cancel

DRS migrates all VMs to a single ESXi host after reverting vCenter Server to a snapshot

book

Article ID: 318199

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
Shortly after reverting to a snapshot on a vCenter Server, DRS begins migrating all virtual machines to a single host in one or more clusters.

Environment

VMware vCenter Server 7.0.x

Cause

This issue occurs when a cluster's Fault Domain ID changes after the snapshot is taken. The ID can change when a cluster is reconfigured for HA or if inconsistencies are detected in the FDM metadata. When this happens, the primary FDM host is considered the only healthy host in the cluster by DRS, and begins a series of mandatory migrations to it from the other hosts in the environment.

This behavior has also be observed for other HA configuration-related failures, such as failed fdm vib installations post-upgrade.

Can be caused by setting a majority of hosts in a cluster to go into maintenance mode.  Even when cancelling the maintenance mode jobs for the hosts after the migrations have started.

Resolution

In vCenter Server 7.0 U3L and 8.0 U2, a 10 minute migration delay has been introduced to help alleviate these issues when HA agents require extra time to reconfigure. A more permanent solution is being worked on for situations where the agents never properly recover by themselves.

Workaround:
To stop the automatic migrations, change the DRS Automation Level from Fully Automated to Manual or Partially Automated.

If a vCenter server must be reverted to a snapshot that had HA changes afterward, disable and re-enable HA on the clusters immediately after reverting to the snapshot.

To avoid this issue, do not configure or reconfigure an HA-enabled cluster after taking a snapshot of vCenter.

Additional Information

vSphere HA actions

Due to the disruption in the communications between HA agents running on the ESX hosts, the HA agents on good hosts (the one or two hosts) will view the unhealthy hosts as Dead (failed). As a result, they will start reacting to that failure and start attempting failover of the VMs on unhealthy hosts. These attempts will try to failover the VM on good hosts. However, since the VMs are still running on the unhealthy hosts, these failover attempts will fail. The failover fails because the VM is still running and the good host cannot power on the VM because VMX files are locked. The failure error for each VM that is running on the set of unhealthy hosts is:

vSphere HA virtual machine failover failed

While these errors are not harmful to the VMs, there is no way to stop these failover attempts as of yet.