Symptoms:
Upgrading hosts with large VM configurations (Greater than 256 vCPUs and 6TB vm memory) using VUM/vLCM may result in timeout if the host cannot enter maintenance mode within the default time limit set by VUM/vLCM. As a result the VUM/vLCM remediation operation will fail.
Here is a testing scenario to upgrade host with large VM configuration (480vCPU and 10TB VM memory) in a two host (host1 and host2) cluster:
- Guest OS SLES15SP0 VM configured with 480 vcpus/10 TB memory was powered on host1.
- Ran Memory/CPU stress workload to consume 80% of the resources.
- Upgrade using VUM remediate. Host1 started to go into maintenance mode. The Maintenance Mode “Operation timed out” in 30 minutes
- Remediate process failed with “Cannot enter maintenance mode” in 30 minutes.
- Host1 entered host maintenance mode after its VM migrated to host2. It took greater than 2 hours for the vm migration to complete.
- Retrying the Remediation (after the migration) now worked fine and completed the upgrade process.