ESXi upgrade failure due to insufficient memory available to create ramdisk stagebootbank

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

Once a host is put into the maintenance mode, the upgrade workflow creates a ramdisk stagebootbank on the host for temporary storage. To do that, the root resource pool of the host should have at least around 300MB free capacity. Creation of ramdisk stagebootbank can fail due to insufficient memory if the host does not have this free capacity and most of the host capacity is taken by the child resource pool.

In general this problem occurs if the sum of Virtual Machine reservations within the resource pool was close to the effective capacity of a host before the Virtual Machines are evacuated from the host during maintenance mode.

To identify this issue, below log snippet is observed in vobd.log :

[GenericCorrelator] 6391139888760us: [vob.user.esximage.install.error] Could not install image profile: Failed to create ramdisk stagebootbank: Errors: No space left on device

Next verify if resource pool with reservation still exists on the esxi host on which the upgrade workflow failed to create ramdisk stagebootbank by running the memstats command for the host system and then into the user group.

In this example, we see that the user group has "pool0" which still reserves 461516MB even after the host has entered into maintenance mode.

[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 0 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB

------------------------------------------------------------------------------------------------------
     gid                         name   parGid        min        max memSizePeak   eMinPeak   rMinPeak
------------------------------------------------------------------------------------------------------
       0                         host       -1     523759     523759      510312     523759     523944
       1                       system        0      48467         -1       48128      49079      49260
       2                          vim        0          0         -1        2640      13747      13747
       3                    iofilters        0          0         -1          13         25         25
       4                         user        0          0         -1      460086     461531     462374
------------------------------------------------------------------------------------------------------

[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB

------------------------------------------------------------------------------------------------------
     gid                         name   parGid        min        max memSizePeak   eMinPeak   rMinPeak
------------------------------------------------------------------------------------------------------
       4                         user        0          0         -1      460086     461531     462374
   61538                        pool0        4     461516         -1      460086     461531     462374
------------------------------------------------------------------------------------------------------

Environment

VMware vCenter Server 7.0.3
VMware vCenter Server 7.0.0
VMware vCenter Server 7.0.1
VMware vCenter Server 6.x
VMware vCenter Server 7.0.2

Cause

The reason behind this problem is that DRS resource setting workflow excludes any host in the entering or entered statuses of the maintenance mode. Given that, DRS didn't have a chance to clear out the child resource pool from the source host. As a result, the host didn't have free capacity for the upgrade workflow to create a ramdisk stagebootbank. Once the host is exited from the maintenance mode, the child resource pool should be cleaned up.

Resolution

This is a known issue affecting vCenter Server 7.0u3 or lower. This will be fixed in future releases.

Workaround:

To workaround the issue:
Inorder to free up the unreleased resource pool capacity, manually remove child resource pool from the host after host has entered maintenance mode.

To Remove child resource pool from host, run below commands on terminal of esxi host.

Check if the host is in maintenance mode. Output 'Enabled' indicates that the host is in maintenance mode.

[root@sc2-rdops-vm05-dhcp-165-200:~] esxcli system maintenanceMode get
Enabled

Get list of pools still on host. Resource pool name on esx will starts with pool

memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB
------------------------------------------------------------------------------------------------------
     gid                         name   parGid        min        max memSizePeak   eMinPeak   rMinPeak
------------------------------------------------------------------------------------------------------
       4                         user        0          0         -1      460086     461531     462374
   61538                        pool0        4     461516         -1      460086     461531     462374
------------------------------------------------------------------------------------------------------
# Delete all resource pool from the user group one by one. Below is an example to delete pool0
[root@sc2-rdops-vm05-dhcp-165-200:~] localcli --plugin-dir /usr/lib/vmware/esxcli/int sched group delete -g host/user/pool0

# Check if no pool exists. Output of the below command indicates that there is no resource pool on host.
[root@sc2-rdops-vm05-dhcp-165-200:~] memstats -r group-stats -g 4 -s name:parGid:min:max:memSizePeak:eMinPeak:rMinPeak -l 2 -u MB

------------------------------------------------------------------------------------------------------
     gid                         name   parGid        min        max memSizePeak   eMinPeak   rMinPeak
------------------------------------------------------------------------------------------------------
       4                         user        0          0         -1      460086     461531     462374
------------------------------------------------------------------------------------------------------

Please note that the following command should be executed when the host is in maintenance mode. Please note that the name of the child resource pool can be different from "pool0". We have to remove all child resource pools from the "host/user" group.