DRS is failing to migrate Virtual Machines from a host experiencing high CPU utilization.
search cancel

DRS is failing to migrate Virtual Machines from a host experiencing high CPU utilization.

book

Article ID: 441179

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • Outgoing Distributed Resource Scheduler (DRS) migrations fail to initiate from a specific ESXi host within a cluster, leading to sustained high CPU or Memory utilization on that host.
  • When validating the load balancing iteration in the vpxd.log, the migration is aborted due to a negative expected gain (GainSec), coupled with an abnormally high estimated vMotion time (vMotionSec):
    /var/log/vmware/vpxd - 

YYYY-MM-DD verbose vpxd[07163] [Originator@6876 sub=cdrsPlmt opID=CdrsLoadBalancer-750944ac] Recommend the best host for [vim.VirtualMachine:]
YYYY-MM-DD verbose vpxd[07163] [Originator@6876 sub=cdrsPlmt opID=CdrsLoadBalancer-750944ac] Expected gain on : Rate: 30, GainSec: -29389, vMotionRate: 1, PlacementType: 1, vMotionSec: 29970

  • One of the NICs assigned to the vMotion switch reports the LINK STATUS as down - 

    Run the command : esxcli network nic list or esxcfg-nics -l

    Name    PCI Device    Driver   Admin Status  Link Status  Speed  Duplex  MAC Address        MTU   Description

    ------  ------------  -------  ------------  -----------  -----  ------  -----------------  ----  -----------
    vmnic0  #:#:#:#:#:#    ntg3     Up            Up            1000  Full    #:#:#:#:#:#        1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
    vmnic1  #:#:#:#:#:#    ntg3     Up            Up            1000  Full    #:#:#:#:#:#        1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
    vmnic2  #:#:#:#:#:#    bnxtnet  Up            Up           25000  Full    #:#:#:#:#:#        1500  Broadcom NetXtreme E-Series Advanced Dual-port 25Gb SFP28 Ethernet OCP 3.0 Adapter
    vmnic3  #:#:#:#:#:#    bnxtnet  Up            Down             0  Half    #:#:#:#:#:#        1500  Broadcom NetXtreme E-Series Advanced Dual-port 25Gb SFP28 Ethernet OCP 3.0 Adapter
    vmnic4  #:#:#:#:#:#    bnxtnet  Up            Up           25000  Full    #:#:#:#:#:#        1500  Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
    vmnic5  #:#:#:#:#:#    bnxtnet  Up            Up           25000  Full    #:#:#:#:#:#        1500  Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller

Environment

VMware vSphere 

Cause

This issue is encountered when a physical network adapter (vmnic) assigned to the virtual switch handling vMotion traffic experiences a physical link failure (Link Status Down). The internal DRS calculation treats any offline NIC as possessing zero bandwidth. This incorrectly zeros out the aggregate vMotion bandwidth calculation for the entire switch even though the other uplink is up and working, thus, drastically inflating the estimated time required to perform a migration (vMotionSec). Consequently, the DRS cost-benefit analysis determines that the cost of moving the virtual machine is too high (resulting in a negative GainSec), and the automated migration is aborted.

Resolution

To restore normal DRS migration functionality, perform the following steps:

  1. Identify the affected vmnic reporting a "Down" link status via the vSphere Client or ESXi command line.

  2. Remove the problematic vmnic from the active uplinks of the associated Distributed Virtual Switch (DVS) or Standard Virtual Switch (VSS) on the affected ESXi host.

  3. Once the offline adapter is removed from the configuration, DRS will recalculate the aggregate bandwidth using only the healthy, active uplinks. The GainSec calculations will return to positive values, and migrations will proceed normally.

  4. Concurrently, inspect the physical infrastructure (cabling, upstream physical switch ports, and host network hardware) to determine and rectify the root cause of the physical link failure.

Additional Information

For more details on the Link Status of the VMNIC, please refer - Setting the admin and link state up or down for a vmnic interface on ESXi