"ENTER MAINTENANCE DRYRUN CHECK" Precheck fails with "Error: Error during enter MAINTENANCE check due to InsufficientResourcesFault"

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:

ENTER MAINTENANCE DRYRUN CHECK Precheck fails with Error : Error during enter MAINTENANCE check due to InsufficientResourcesFault
In lcm.log you will see similar entries as below

2020-07-07T00:03:20.472+0000 ERROR [0000000000000000,0000,precheckId=180134f2-9c3f-414c-b536-4d3540700f70,resourceType=NSX,resourceId=af69be13-9695-4235-8259-19ae819dee24] [c.v.e.s.l.c.v.vsphere.VsphereUtils,pool-2-thread-449] Error during enter MAINTENANCE check due to InsufficientResourcesFault { "_msg": "Insufficient resources.", "_faultMsg": [ { "key": "com.vmware.cdrs.maintenancemode.clusterLoadViolated", "arg": [ { "key": "threshold", "value": 80 }, { "key": "clustload", "value": 92 }, { "key": "resource", "value": "memory" } ], "message": "Host cannot enter maintenance mode since the resulting cluster memory load (92%) exceeds the tolerence threshold (80%)." } ], "stackTrace": [], "suppressedExceptions": [] }

You may notice similar message as below

Error description: the virtual machine is pinned to a host: Insufficient resources.
Impact: High: Do not perform upgrade without addressing this issue.
Remediation: This category is for errors that were not due to general exceptions. Check for errors in error object, check LCM logs –InsuffcientResourceFault: Check the HA settings in the VC UI. See if the error is due to HA configuration KB: https://kb.vmware.com/s/article/2005073 Or This issue occurs when the CPU resource reservations of the virtual machine exceed the available capacity on the target host. KB :https://kb.vmware.com/s/article/1031306

In lcm.log you will see similar message as below

Error during enter MAINTENANCE check due to InsufficientResourcesFault
"Host cannot enter maintenance mode since the resulting cluster memory load (81%) exceeds the tolerence threshold (80%)
Host cannot enter maintenance mode since the resulting cluster cpu load (89%) exceeds the tolerence threshold

2020-09-15T20:30:29.355+0000 ERROR [0000000000000000,0000,precheckId=2722bf5e-9a63-4459-8560-9599cf7c2145,resourceType=ESX,resourceId=c71c0021-9fb8-11e8-a910-1dd898ec6280] [c.v.e.s.l.c.v.vsphere.VsphereUtils,pool-2-thread-64] Error during enter MAINTENANCE check due to InsufficientResourcesFault { "_msg": "Insufficient resources.", "_faultMsg": [ { "key": "com.vmware.cdrs.maintenancemode.clusterLoadViolated", "arg": [ { "key": "threshold", "value": 80 }, { "key": "clustload", "value": 81 }, { "key": "resource", "value": "memory" } ], "message": "Host cannot enter maintenance mode since the resulting cluster memory load (81%) exceeds the tolerence threshold (80%)." } ], "stackTrace": [], "suppressedExceptions": [] }
2020-09-15T20:30:29.355+0000 WARN [0000000000000000,0000,precheckId=2722bf5e-9a63-4459-8560-9599cf7c2145,resourceType=ESX,resourceId=c71c0021-9fb8-11e8-a910-1dd898ec6280] [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,pool-2-thread-64] Shutting down the connection monitor.
2020-09-15T20:30:29.355+0000 WARN [0000000000000000,0000] [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,VLSI-client-connection-monitor-254] Interrupted, no more connection pool cleanups will be performed.

Note:The preceding log excerpts are only examples.Date,time and environmental variables may vary depending on your environment.

Environment

VMware Cloud Foundation 4.1
VMware Cloud Foundation 3.10.x
VMware Cloud Foundation 4.2.x
VMware Cloud Foundation 4.0.x

Resolution

From VCF 4.3 version we have provided a flag to suppress the dry run EMM flags to ignore this during upgrade and proceed.

Add the following lines to the end of /opt/vmware/vcf/lcm/lcm-app/conf/application-prod.properties :

lcm.nsxt.suppress.dry.run.emm.check=true
lcm.esx.suppress.dry.run.emm.check.failures=true

2. Restart the lcm service using below command
systemctl restart lcm

Workaround:
Cluster utilization can be verified from VC as below:

Login to VC
Navigate to the cluster under "Hosts and Clusters"
Under "Monitor" tab, look for "vSphere DRS"
Verify "CPU Utilization"
Verify "Memory Utilization".

Then follow any of the below steps
1) Shut down VMs that are not critical for operation during upgrade. This will reduce memory/CPU requirement on cluster.
2) Disable HA and HA Admission Control this will remove memory constraints.

Then retry the precheck.

Additional Information

Impact/Risks:
None