Unexpected VM restart during snapshot consolidation due to scsiCmdSlab buffer exhaustion in ESXi

Products

VMware vSphere ESXi

Issue/Introduction

A virtual machine is unexpectedly restarted during an attempt to consolidate its existing snapshot(s)
The snapshot consolidation task fails

The current log for the virtual machine (/vmfs/volumes/<datastore>/<vm_name>/vmware.log) has entries indicating a memory exhaustion issue, similar to the following example:

####-##-##T##:##:##.###Z In(05) vcpu-0 - Msg_Post: Error
####-##-##T##:##:##.###Z In(05) vcpu-0 - [vob.fssvec.Lookup.file.failed] File system specific implementation of Lookup[file] failed
####-##-##T##:##:##.###Z In(05) vcpu-0 - [vob.fssvec.Lookup.file.failed] File system specific implementation of Lookup[file] failed
####-##-##T##:##:##.###Z In(05) vcpu-0 - [vob.fssvec.Lookup.file.failed] File system specific implementation of Lookup[file] failed
####-##-##T##:##:##.###Z In(05) vcpu-0 - [msg.literal] Cannot allocate memory
####-##-##T##:##:##.###Z In(05) vcpu-0 - [msg.disk.noBackEnd] Cannot open the disk '<snapshot>.vmdk' or one of the snapshot disks it depends on.
####-##-##T##:##:##.###Z In(05) vcpu-0 - [msg.checkpoint.continuesync.error] An operation required the virtual machine to quiesce and the virtual machine was unable to continue running.
####-##-##T##:##:##.###Z In(05) vcpu-0 - ----------------------------------------
####-##-##T##:##:##.###Z In(05) vcpu-0 - MsgIsAnswered: Using builtin default 'OK' as the answer for 'msg.checkpoint.continuesync.error'
####-##-##T##:##:##.###Z In(05) vcpu-0 - SnapshotVMX_ConsolidateCancel: Requesting snapshot consolidate cancel.
####-##-##T##:##:##.###Z In(05) vcpu-0 - Msg_Post: Error
####-##-##T##:##:##.###Z In(05) vcpu-0 - [msg.poweroff.commitOn] Performing disk cleanup. Cannot power off.
####-##-##T##:##:##.###Z In(05) vcpu-0 - ----------------------------------------

Around the same time /var/run/log/vmkernel.log reports that the scsiCmdSlab ran out of memory:

####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu43:2447718)WARNING: scsiCmdSlab out of memory
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu49:2447717)WARNING: scsiCmdSlab out of memory
####-##-##T##:##:##.###Z In(182) vmkernel: cpu43:2447718)ScsiFds: 767: Allocate command from childToken failed:Out of memory resID:2447718, originSN:0, originHandle:0x0
####-##-##T##:##:##.###Z In(182) vmkernel: cpu49:2447717)ScsiFds: 767: Allocate command from childToken failed:Out of memory resID:2447717, originSN:0, originHandle:0x0
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu49:2447717)WARNING: ScsiDeviceIO: 233: Out of Memory... Trying from emergency heap
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu39:2447721)WARNING: ScsiDeviceIO: 233: Out of Memory... Trying from emergency heap
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu49:2447717)WARNING: ScsiDeviceIO: 6519: Failed to allocate memory for I/O to device naa.###

Environment

VMware vSphere ESXi 8.0.x

Cause

During snapshot consolidation or -removal ESXi will send automatic unmap commands to the datastore. Per default these commands are being sent at a rate of 100 MB/s.
If the datastore connection cannot keep up with these commands, they will queue up in the scsiCmdSlab buffer, leading to an increase in the amount of memory the buffer allocates.
However, scsiCmdSlab has a limit on how much memory it can use, thus if there are too many queued commands, the heap will report memory exhaustion, leading to similar issues as described above.

Resolution

In order to prevent this issue from occurring, you can reduce the amount of unmap commands sent to the datastore by changing the automatic space reclamation rate to 10 MB/s.
To do this, please follow the steps outlined in How to throttle the unmap requests on Datastore ( Space Reclamation ).

Additional Information

For information on how to monitor the automatic unmap I/O, please refer to Monitor automatic unmap I/O issued by ESXi.