Once a host is put into the maintenance mode, the upgrade workflow creates a ramdisk stagebootbank on the host for temporary storage. To do that, the root resource pool of the host should have at least around 300MB free capacity. Creation of ramdisk stagebootbank can fail due to insufficient memory if the host does not have this free capacity and most of the host capacity is taken by the child resource pool.
In general this problem occurs if the sum of Virtual Machine reservations within the resource pool was close to the effective capacity of a host before the Virtual Machines are evacuated from the host during maintenance mode.
To identify this issue, below log snippet is observed in vobd.log :
[GenericCorrelator] 6391139888760us: [vob.user.esximage.install.error] Could not install image profile: Failed to create ramdisk stagebootbank: Errors: No space left on device
Next verify if resource pool with reservation still exists on the esxi host on which the upgrade workflow failed to create ramdisk stagebootbank by running the memstats command for the host system and then into the user group.
In this example, we see that the user group has "pool0" which still reserves 461516MB even after the host has entered into maintenance mode.
[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 0 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB
------------------------------------------------------------------------------------------------------
gid name parGid min max memSizePeak eMinPeak rMinPeak
------------------------------------------------------------------------------------------------------
0 host -1 523759 523759 510312 523759 523944
1 system 0 48467 -1 48128 49079 49260
2 vim 0 0 -1 2640 13747 13747
3 iofilters 0 0 -1 13 25 25
4 user 0 0 -1 460086 461531 462374
------------------------------------------------------------------------------------------------------
[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB
------------------------------------------------------------------------------------------------------
gid name parGid min max memSizePeak eMinPeak rMinPeak
------------------------------------------------------------------------------------------------------
4 user 0 0 -1 460086 461531 462374
61538 pool0 4 461516 -1 460086 461531 462374
------------------------------------------------------------------------------------------------------
The reason behind this problem is that DRS resource setting workflow excludes any host in the entering or entered statuses of the maintenance mode. Given that, DRS didn't have a chance to clear out the child resource pool from the source host. As a result, the host didn't have free capacity for the upgrade workflow to create a ramdisk stagebootbank. Once the host is exited from the maintenance mode, the child resource pool should be cleaned up.
This is a known issue affecting vCenter Server 7.0u3 or lower. This will be fixed in future releases.
To workaround the issue:
Inorder to free up the unreleased resource pool capacity, manually remove child resource pool from the host after host has entered maintenance mode.
To Remove child resource pool from host, run below commands on terminal of esxi host.
[root@sc2-rdops-vm05-dhcp-165-200:~] esxcli system maintenanceMode get
Enabled
memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB
------------------------------------------------------------------------------------------------------
gid name parGid min max memSizePeak eMinPeak rMinPeak
------------------------------------------------------------------------------------------------------
4 user 0 0 -1 460086 461531 462374
61538 pool0 4 461516 -1 460086 461531 462374
------------------------------------------------------------------------------------------------------
# Delete all resource pool from the user group one by one. Below is an example to delete pool0
[root@sc2-rdops-vm05-dhcp-165-200:~] localcli --plugin-dir /usr/lib/vmware/esxcli/int sched group delete -g host/user/pool0
# Check if no pool exists. Output of the below command indicates that there is no resource pool on host.
[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB
------------------------------------------------------------------------------------------------------
gid name parGid min max memSizePeak eMinPeak rMinPeak
------------------------------------------------------------------------------------------------------
4 user 0 0 -1 460086 461531 462374
------------------------------------------------------------------------------------------------------
Please note that the following command should be executed when the host is in maintenance mode. Please note that the name of the child resource pool can be different from "pool0". We have to remove all child resource pools from the "host/user" group.