Third-party I/O Filter Upgrade Task Triggers Simultaneous Maintenance Mode on All ESXi Hosts in a Cluster
search cancel

Third-party I/O Filter Upgrade Task Triggers Simultaneous Maintenance Mode on All ESXi Hosts in a Cluster

book

Article ID: 433837

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When performing a cluster-level remediation to upgrade third-party IO_filter VIBs (e.g., EMC bootbank) on ESXi hosts, VMware ESX Agent Manager (EAM) may trigger maintenance mode for all hosts in the cluster simultaneously, rather than following a rolling upgrade sequence.
This can result in a cluster-wide outage or significant workload disruption.


From /var/log/vmware/eam/eam.log, it can be observed that the API call was completed, and it requires the host to be put in maintenance mode: 

YYYY-MM-DDT##:##:##.###Z | INFO | vlsi | LocalizationFilter.java | ### | API COMPLETE: HostVMAgency(ID:########-####-####-####-############).getRuntime[opId=#########, sessionId=########]. Result:
eam.EamObject.RuntimeInfo {
   entity = 'Agency:########-####-####-####-############:null',
   status = 'yellow',
   issue = (eam.issue.Issue) [
      (eam.issue.VibRequiresHostInMaintenanceMode) {
         time = YYYY-MM-DD ##:##:##,###,
         key = ##,
         description = 'VIB operation requires the host to be put in maintenance mode',
         agency = 'Agency:########-####-####-####-############:null',
         agencyName = 'IoFilter-partner_bootbank_Partner Appliance',
         solutionId = 'VirtualCenter',
         solutionName = 'VirtualCenter',
         host = 'HostSystem:host-#######:########-####-####-####-############',
         hostName = 'hostname.####.com',
         agent = 'Agent:########-####-####-####-############:null',
         agentName = '########-####-####-####-############'


DRS recommends EAM to put hosts in maintenance mode.

/var/log/vmware/vpxd/vpxd.log shows multiple vim.ClusterComputeResource.enterMaintenanceMode tasks initiated for different hosts under the same opID.

YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=vpxLro opID=#######] [VpxLRO] -- BEGIN task-####### -- domain-ID -- vim.ClusterComputeResource.enterMaintenanceMode -- uuid
YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=cdrsPlmt opID=#######] XlbRunStatus: Off
YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=drmLogger opID=#######] Host: [vim.HostSystem:host-#####,hostname.####.com], powered-off VMs evacTime: ####, powered-on VMs evacTime: ####, Mgmt VMs: ####
YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=drmLogger opID=#######] Host: [vim.HostSystem:host-#####,hostname.####.com], powered-off VMs evacTime: ####, powered-on VMs evacTime: ####, Mgmt VMs: ####
YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=drmLogger opID=#######] Host: [vim.HostSystem:host-#####,hostname.####.com], powered-off VMs evacTime: ####, powered-on VMs evacTime: ####, Mgmt VMs: ####
YYYY-MM-DDTHH:MM:SS.####+##:## info vpxd[#####] [Originator@#### sub=drmLogger opID=#######] Host: [vim.HostSystem:host-#####,hostname.####.com], powered-off VMs evacTime: ####, powered-on VMs evacTime: ####, Mgmt VMs: ####

DRS recommends many hosts, causing vMotions and DRS faults, which forces EAM to cancel the EMM tasks:


YYYY-MM-DDTHH:MM:SS |  WARN | host-ID | ChangeHostMaintenanceModeJob.java | 80 | Host: VcHostSystem(ID: host-#####) failed to enter maintenance mode.com.vmware.eam.async.JobCancelledException
........
YYYY-MM-DDTHH:MM:SS.488Z |  INFO | host-ID | AuditedJob.java | 106 | JOB CANCELED: [#####] ChangeHostMaintenanceModeJob(ManagedObjectReference: type = HostSystem, value = host-#####, serverGuid = ###################, true)

EAM eventually fails the tasks with timeout faults, raising the following issue:


YYYY-MM-DDTHH:MM:SS |  INFO | host-###### | IssueHandlerBase.java | 116 | Updating issues:
New issues:
 [
eam.issue.VibCannotPutHostInMaintenanceMode {
   time = YYYY-MM-DDTHH:MM:SS,
   key = 85,
   description = <unset>,
   agency = 'Agency:##########################:null',
   agencyName = 'IoFilter-EMC_bootbank_emcsplitter',
   solutionId = 'VirtualCenter',
   solutionName = 'VirtualCenter',
   host = 'HostSystem:host-ID#############',
   hostName = '#########',
   agent = 'Agent:###################:null',
   agentName = '#######################',
}
]

 

 

Environment

vCenter Server 7.x
vCenter Server 8.x

Cause

EAM consults Distributed Resource Scheduler (DRS) to recommend hosts for maintenance mode. By default, DRS attempts to maximize the number of hosts that can be placed in maintenance mode simultaneously while keeping cluster utilization at 100%. This behavior overcommits the remaining active hosts in the cluster, leading to maintenance mode failures, task cancellations, and potential service disruption.

 

Resolution

You can control the cluster utilization during this maintenance event by either choosing to put one host at a time into maintenance OR by configuring DRS to not overcommit hosts in the cluster by choosing a conservative value for demandCapacityRatio

To workaround this issue:

Option1: (Safer Option)

Access the EAM MOB by navigating to: https://<VC_IP>/eam/mob?vmodl=1
Note: It is critical to include the query parameter vmodl=1 to access internal API methods.

  1. Log in using an administrator account or a user with the required privileges: EAM.View and EAM.Modify. Typically, only administrator-level users have these permissions.
  2. Locate the method named: setMaintenanceModePolicy.
  3. Click Invoke Method and provide the following input: "singleHost" to enable single-host maintenance mode
  4. "multipleHosts" to revert to the default behavior
  5. Verify the configuration by checking the log file located at: /var/log/vmware/eam/eam.log and confirm that it contains the expected entry indicating the policy change.

Option2: (More Optimized Option) 

Configuring the drs.demandCapacityRatio to 80 provides a more optimized balance of cluster resources. However, this setting may result in multiple hosts entering Maintenance Mode (MM) simultaneously, along with an increased number of vMotion operations.
To modify the configuration, update the following file:

  1. /etc/vmware-eam/eam-vim.properties
  2. drs.demandCapacityRatio=80

Option 2 is a more optimized option , whereas Option 1 by contrast, enforces an MM policy that limits Maintenance Mode to a single host at a time ensures a safer and more controlled process, although it may take longer to complete.