DRS Initial Placement fails to distribute virtual machines during a power-on storm
search cancel

DRS Initial Placement fails to distribute virtual machines during a power-on storm

book

Article ID: 436750

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When performing a bulk power-on of many virtual machines (a "power-on storm") that are all registered to a single host, DRS Initial Placement may fail to distribute the virtual machines(VMs) across the cluster at the time of power-on.

As a result, a disproportionately large number of VMs may remain running on the source host for a period of time. While DRS will eventually rebalance the cluster after the storm subsides, this self-recovery process can take anywhere from a few minutes to several dozen minutes, depending on the severity of the imbalance.

Environment

VMware vCenter Server 7.0

VMware vCenter Server 8.0

VMware vCenter 9.0

Cause

This is a known issue. During a rapid sequence of power-on requests originating from the same host, the internal DRS metrics used to evaluate the cost-benefit of relocating a VM before power-on can temporarily remain stale.

As a result of this, DRS determines that migrating the VM before power-on is too "costly" and opts to power on the VM locally on the source host. Once the power-on storm stops, these metrics self-recover over time, and DRS will eventually initiate load-balancing migrations to resolve the skew.

Resolution

Broadcom is aware of this issue and is working on a resolution.

To ensure DRS correctly distributes VMs during bulk operations and avoid temporary cluster imbalance, implement the following measures:

  1. Introduce a Delay: Add a delay between each VM power-on request. A delay of 15 to 30 seconds between power-ons on the same host allows DRS metrics to remain stable and properly distribute the VMs across the cluster.

  2. Manual Rebalancing: If immediate balance is required after a power-on storm, administrators can manually vMotion the "stuck" VMs to other hosts. Otherwise, the system will eventually rebalance itself after the metrics recover.