Symptoms:
Shortly after reverting to a snapshot on a vCenter Server, DRS begins migrating all virtual machines to a single host in one or more clusters.
This issue occurs when a cluster's Fault Domain ID changes after the snapshot is taken. The ID can change when a cluster is reconfigured for HA or if inconsistencies are detected in the FDM metadata. When this happens, the primary FDM host is considered the only healthy host in the cluster by DRS, and begins a series of mandatory migrations to it from the other hosts in the environment.
This behavior has also be observed for other HA configuration-related failures, such as failed fdm vib installations post-upgrade.
Can be caused by setting a majority of hosts in a cluster to go into maintenance mode. Even when cancelling the maintenance mode jobs for the hosts after the migrations have started.
vSphere HA actions
Due to the disruption in the communications between HA agents running on the ESX hosts, the HA agents on good hosts (the one or two hosts) will view the unhealthy hosts as Dead (failed). As a result, they will start reacting to that failure and start attempting failover of the VMs on unhealthy hosts. These attempts will try to failover the VM on good hosts. However, since the VMs are still running on the unhealthy hosts, these failover attempts will fail. The failover fails because the VM is still running and the good host cannot power on the VM because VMX files are locked. The failure error for each VM that is running on the set of unhealthy hosts is:
vSphere HA virtual machine failover failed
While these errors are not harmful to the VMs, there is no way to stop these failover attempts as of yet.