Estimate the time required to consolidate virtual machine snapshots

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information to estimate the time required to consolidate snapshots during the snapshot removal for ESXi.

This is useful in situations where consolidating snapshots is taking a long time in ESXi

When a virtual machine is running without snapshots, the virtual machine runs from and changes are written to the base virtual disk (flat.vmdk). When a snapshot is taken, any changes are written to the snapshot delta/sesparse file (delta.vmdk/sesparse.vmdk). If another snapshot is taken, a second delta/sesparse file is created, and so on.

For more information, see Overview of virtual machine snapshots in vSphere.

When you delete/consolidate a snapshot, the blocks contained in the delta file are written to the parent disk. When you trigger a Delete All snapshots operation or use the consolidate feature introduced in vSphere 5.0, the process writes all the blocks from the entire chain back to the base disk.

For more information on snapshot deletion or consolidation, see:

Notes:

If there is more than one snapshot in the chain and you delete only one snapshot, the process consolidates one delta file into its parent disk, but the same concept applies.
If there is only one snapshot in the chain, clicking Delete or Delete All performs the same operation.

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Resolution

Contributing factors

The time required to commit snapshots is directly related to:

The aggregated size of the delta files.
The virtual disk snapshot chain depth (the number of delta files for a given virtual disk).
The total overhead size of the snapshot delta disks. This is directly related to the aggregated size of the deltas. If there is a large number of blocks to analyze, it significantly increases the number of reads for analyzing the metadata, thus increasing the overall time for consolidation. For more information, see Creating a snapshot for an ESX/ESXi virtual machine fails with the error: File is larger then maximum file size supported.
The storage array performance including, but not limited to, storage processor performance, infrastructure contention/bottlenecks, hardware acceleration, the number of physical disks, the number of spindles, disk speed, and RAID configuration.
The type of data contained in each block (zeros vs. random data).
The load on the host, which is responsible for resource management and prioritizing tasks.
The disk I/O activity of a powered on virtual machine having a direct impact on how fast the current delta files are growing. For example, a database or email server virtual machine may be extremely I/O intensive.

Notes:

The number of blocks that need to be read/written cannot be determined if there is more than one delta file to consolidate. This is because duplicate copies of the same block is possible.
If there is only one snapshot, all blocks of data in the delta file are written to the base disk. When there are multiple delta files, all blocks in all snapshots may be unique and may have to be written to the base disk in a worst case scenario. The number of reads in this scenario is equal to the number of writes in addition to metadata read operations.
If disk consolidation is started when the virtual machine is powered on, an additional delta file is created to track the modified blocks, which is finally written to the base disk at the end of the consolidation. However, no additional delta file is required when deleting only one snapshot which is not the current one.
Virtual disk snapshot consolidation is a very I/O intensive task and may require extended periods of heavy reads and writes. Consolidating snapshots during production hours may impact other virtual machines running on the same datastore.

Calculating the time required to consolidate

Perform the method from below article to estimate the amount of time required to consolidate the snapshots:

How to calculate Virtual Machine snapshot consolidation

Starting consolidation and monitoring the throughput on the datastore

Warning: After the consolidation process is started, be aware that it can no longer be stopped.

In this method, you can monitor the throughput (reads and writes in MB/sec) of the LUN that the virtual disk resides on using esxtop. You can then estimate the time based on the aggregated size of the delta files.

Obtain the size of delta files using the datastore browser or by running the command:

ls -lh /vmfs/volumes/DATASTORE_NAME/VM_NAME | grep -E "delta|sparse"
Calculate and make a note of the combined size of the delta files.
Identify the device on which the datastore resides. For more information, see Identifying disks when working with VMware ESXi.

Note: For NFS datastores, this step is not relevant.
Monitor LUN I/O throughput using esxtop:
1. Start esxtop by running the command:
  
  esxtop
2. Press u to switch to the disk device view.
3. To view the entire device name, press Shift + L (uppercase "L") and enter 36.
Find the device containing the datastore in the list as per step 3 and monitor the MBREAD/s and MBWRTN/s columns.
Using the total size from step 2 and the read and write rates from step 5, you can estimate the total time required.

Note: Because the disk consolidation is generally I/O intensive, you may sort by MBREAD/s (press R) or MBWRTN/s (press T) to see the device at the top of the screen.

Alternatively, you can use the performance charts in the vSphere Client while connected to vCenter Server or directly to an ESXi/ESX host to monitor the read/write rate for a given datastore.

Notes:

The test is relevant only when a single virtual machine resides on the same datastore, running on that host. If other virtual machines located on the same device are running on this host, you can only get the aggregate throughput for all virtual machines. To get an accurate value, ensure that the virtual machine is the only one running at the time. Migration using vMotion can help achieve this.
If the delta files are located on a different datastore than the base disk (and thus, on a different device), the reads and writes are issued to/from different devices.
If you are using datastore extents, consolidation performance may be impacted and run time calculation is far more complex because you must also consider the aggregate throughput for all extents. In this case, it is easier to read the throughput at the datastore level in vCenter Server.
You may notice a high read rate with a low write rate at the beginning of the process because the process analyses the metadata for the entire snapshot chain first.

Monitoring throughput for only the snapshot consolidation process

This method provides a better estimation because you can monitor the throughput of the process itself, not the aggregated throughput on the volume.

Limitations:

This method cannot be used with NFS because esxtop does not allow to expand on NFS.
This method cannot be used if the VM is powered off because the process in handled by hostd, not by the VM monitor. Theoretically, it is still possible to find which hostd thread is doing the work, but it is more complex.

To monitor throughput for only the snapshot consolidation process:

Start esxtop by running the command:

esxtop
Press Shift + V (uppercase "V") to see only running virtual machines.
Find the virtual machine running the consolidation.
Press e to expand.
Enter the Group World ID (the value in the GID column) and press Enter.
Make a note of the World ID (the value in the ID column) of the snapshot consolidation process:
- The process is called vmx-SnapshotVMX
Press u to display the disk device statistics.
Press e to expand and enter the device where the snapshot consolidation process is writing to. For example, the naa.xxx value.

Note: For a normal VMDK file, the device is the datastore where the flat file is located. For an RDM, the device is the RDM device itself. For a flat VMDK file, you can identify the datastore device ID by running the esxcfg-scsidevs -m command. For an RDM, running the vmkfstools -q command against the pointer file reveals the VML ID, which must be correlated with the output of the ls -l /vmfs/devices/disks/ command to obtain the device ID. For more information, see Identifying disks when working with VMware ESXi.
Identify the Group World ID from step 6.

Note: You may need to sort by MBREAD/s (press R) or MBWRTN/s (press T) to see the process at the top of the screen.
Monitor the MBWRTN/s column.

Estimating consolidation time using a test virtual machine

Trying to estimate before actually running the process is complex because it is difficult to recreate an identical context; more specifically, the virtual machine activity and the type of data contained within the deltas.

To estimate the disk consolidation time:

Calculate the size of the delta files that are to be consolidated using steps 1 and 2 in the How to calculate Virtual Machine snapshot consolidation section of this article.
Create a test virtual machine (or use a non-critical virtual machine) on the same host and datastore.
Take a snapshot and generate random data within the guest (not zeros) using a file copy, for example. Alternatively, you may use a random file generator tool. In Linux, use the dd command with the parameter, if=/dev/urandom.

Note: Do not use if=/dev/zero or if=/dev/random.
Check the delta size that has been created (per the method in step 1).
Run a Consolidation and time the operation.
Extrapolate to the size of deltas to be consolidated in step 1.

Notes:

This method does not include the delta growth that may occur for the virtual machine where the snapshot consolidation occurs.
This method does not recreate the same type of data contained in the delta files.