When performing a bulk power-on of many virtual machines (a "power-on storm") that are all registered to a single host, DRS Initial Placement may fail to distribute the virtual machines(VMs) across the cluster at the time of power-on.
As a result, a disproportionately large number of VMs may remain running on the source host for a period of time. While DRS will eventually rebalance the cluster after the storm subsides, this self-recovery process can take anywhere from a few minutes to several dozen minutes, depending on the severity of the imbalance.
VMware vCenter Server 7.0
VMware vCenter Server 8.0
VMware vCenter 9.0
This is a known issue. During a rapid sequence of power-on requests originating from the same host, the internal DRS metrics used to evaluate the cost-benefit of relocating a VM before power-on can temporarily remain stale.
As a result of this, DRS determines that migrating the VM before power-on is too "costly" and opts to power on the VM locally on the source host. Once the power-on storm stops, these metrics self-recover over time, and DRS will eventually initiate load-balancing migrations to resolve the skew.
Broadcom is aware of this issue and is working on a resolution.
To ensure DRS correctly distributes VMs during bulk operations and avoid temporary cluster imbalance, implement the following measures:
Introduce a Delay: Add a delay between each VM power-on request. A delay of 15 to 30 seconds between power-ons on the same host allows DRS metrics to remain stable and properly distribute the VMs across the cluster.
Manual Rebalancing: If immediate balance is required after a power-on storm, administrators can manually vMotion the "stuck" VMs to other hosts. Otherwise, the system will eventually rebalance itself after the metrics recover.