Diagnostics for VMware Cloud Foundation: Snapshot Dashboard

Products

VCF Operations

Issue/Introduction

The Snapshot Dashboard reports in-progress snapshot deletion or consolidation tasks that have been running for more than four hours. For each such snapshot deletion or consolidation, a drill-down page lists the datastores on which the VM has disks and a “check latency” dialog for each datastore. The “check latency” dialog provides three performance charts for each datastore on which the VM has a disk. This information helps you assess if networking, storage, or system contention may be increasing the time taken to complete the operation. Involve your infrastructure team or vendor if networking or storage issues are suspected. Troubleshooting steps are described in the Resolution section below.

A snapshot deletion or consolidation operation that is running when a VM is under load could impact the use of the virtual machine, such as if the operation continues into business hours. If the virtual machine is registered to a ESX 9.0 or later host, vCenter will report the estimated time to completion and progress percentage. This estimate is based on the time taken so far and the progress percentage. Navigate to the vCenter UI and locate the consolidate/delete snapshot task in the recent task list. Use this information to determine whether the operation is progressing as fast as expected and whether the long running operation might take so long as to be a business problem. You can use the metrics VCF Operations maintains to evaluate if the VM’s virtual disk IOPs, throughput, and/or latency is what has been observed previously. For example, navigate to the VM’s metric page and view the Virtual Disk Aggregate of all Instances|Total IOPS metric. If the operation is interrupting the business function, review Determining Progress of Long running Snapshot Operations and Cancelling Them for options for canceling an in progress snapshot delete/consolidation operation.

The Snapshot Dashboard also reports virtual machines that require a disk consolidation. This condition is triggered after a snapshot deletion fails. To determine more details of the corresponding snapshot-deletion failure, review the snapshot failure information provided in the dashboard as well as in vCenter.

Note: long running snapshot delete and consolidate operations are not expected if the underlying datastore is using native snapshots. Native snapshots are offered by vVol, NFS with VAAI, and vSAN ESA. Native snapshots do not involve the movement of data when deleting snapshots.

Environment

Operations for VMware Cloud Foundation 9.0

Resolution

Problems with the Snapshots Dashboard

Problem: the dashboard is not reporting any snapshots or is missing operations from certain vCenters.

Resolution: VCF operations collects the data for this feature from each vCenter instance. This data collection is implemented by vCenter adapters. Verify there is an active integration for each vCenter. If any VC adapters are in the stopped state, start collection. If any adapters are reporting a collection warning or an error, address the problem.

Problem: the dashboard snapshot-operations table does not list all successfully completed snapshot delete and consolidation operations

Resolution: This behavior is intentional. The table only includes successful snapshot delete and consolidation operations that took more than four hours to complete.

Problem: the information reported in the dashboard about a failure is not sufficient to understand why the failure occurred.

Resolution: for some failures, vCenter reports additional information. In the vSphere Client, navigate to the Task Console, expand the row for the task you wish to investigate, and review the related events and error stack (if reported). If this information is insufficient or as a second source, use the Infrastructure Operations > Analyze > Logs page to search for the log statements issued by the operation. Follow instructions in Troubleshooting vCenter Tasks Using VCF Operations Log Queries.

Problem: A snapshot delete operation failed and the VM needs consolidation. The snapshot-delete failure is not reported in the list of snapshot failures.

Resolution: the list of snapshot failures does not include snapshot delete operations that fail during the second step of a snapshot delete operation. In the first step, snapshots are unlinked, and in the second, required data consolidations are done. If a snapshot delete fails in the second step, vCenter intentionally reports the task as successful and issues events to warn about the failure. In 9.0, diagnostics does not report this type of failure. To learn more about the failure, log into the corresponding vCenter, and view the task history for the VM.

Troubleshoot a Long Running Snapshot Consolidation or Deletion on vSAN OSA

The time taken for a snapshot consolidation or deletion to complete depends on multiple factors including the amount of data that must be reconciled. If the virtual machine is registered to a ESX 9.0 or later host, vCenter will report the estimated time to completion and progress percentage. This estimate is based on the time taken so far and the number of bytes processed already. Navigate to the vCenter UI and locate the consolidate/delete snapshot task in the recent task list. If the VM is not registered to a ESX 9.0 or greater host, you can follow KB Estimate the time required to consolidate virtual machine snapshots to roughly estimate the amount of data that must be processed and the time required to process it.

If the VM is powered on, evaluate whether the write IOPs to its virtual disks is high. A high write IOPs can cause congestion between these writes and the read/write operations issued to consolidate snapshots. Navigate to the VM summary page for the VM in the VCF Operations UI and view the virtual disk metrics for it. If you have enabled instance metrics, VCF Operations reports the write IOPs for each virtual disk. Determine whether the VM has a high write IOPs rate to any of its disks. If the IOPs rate is high, try to reduce the number of writes being issued by the VM.

Next, check for some common causes of poor vSAN OSA snapshot-consolidation performance by running the vSAN performance diagnostics troubleshooting workflow offered in VCF Operations and Management. Navigate to Infrastructure Operations -> Storage Operations, and click on view/run diagnostics. Select the troubleshooting goal. This troubleshooting workflow will check for high physical device latency, ongoing resyncs, and high DOM-owner latency. If these conditions are present, follow the vSAN diagnostic guidance.

To identify additional causes, navigate to the “vSAN OSA Performance” dashboard within the “performance -> provider” dashboard group. Select the vSAN cluster containing the VM. If there is significant congestion or latency, consider eliminating some of the IO load on the cluster by quiescing, suspending, or relocating other VMs that are generating a lot of IO traffic.

Finally, to gain insight on specific conditions impacting the VM, open the vCenter UI and navigate to the vSAN I/O trip Analyzer and use it to determine if there are other bottlenecks impacting the consolidation.

Troubleshoot a Long Running Snapshot Consolidation or Deletion Other Storage Types

The “host and VM metrics” dialog presents three performance charts that can help you understand the time taken to complete a snapshot deletion or consolidation operation for non-native snapshots. It is on

The first two charts show the average total latency and the average total throughput for the datastore on which the selected virtual disk is stored. The third reports the write operations issued if the virtual machine is powered on.

If the VM is powered on, evaluate whether the write IOPs to its virtual disks is high. A high write IOPs can cause congestion between these writes and the read/write operations issued to consolidate snapshots. Review each of the “check latency” dialogs for the VM to determine whether the VM has a high write IOPs rate to any of its datastores. If the IOPs rate is high, try to reduce the number of writes being issued by the VM. To determine the virtual disks receiving the writes, navigate to the metrics tab for the VM in the VCF Operations and Management UI. View the virtual disk instance metrics within the virtual disk metrics group.

To accelerate snapshot consolidation, maximize resource availability for the VM by reducing contention on its registered host and/or the datastores receiving the majority of writes. These optimizations can help both powered on and powered off VMs.

First, review the memory and CPU allocation on the host. If the VM is powered on, determine whether the VM is running within a resource pool that is excessively limiting the CPU and memory allocation to the VM. If so, change the resource pool configuration to allocate more resources. In addition, review the overall CPU and memory utilization of the host to which the VM is registered. Snapshot consolidation is performed by the host on which the VM is registered. If the host is over committed, migrate powered on VMs to other hosts. A second side-effect of CPU or memory over commit is that a host may drop packets it is receiving, leading to retransmissions, and lower overall throughput, and increased datastore latency. Review the metric Network | Total Received Packets Dropped. Network errors.

Second, review the datastore throughput and latency for the host on which the VM is running for each datastore on which the VM has virtual disks. The “check latency” dialogs provide this data for each datastore used by VM. The charts report the latency/throughput as observed by the host on which the VM is registered. Hosts that mount the same datastore can observe different latencies and throughputs due to network differences and load conditions on the individual hosts. Total latency (throughput) is the average of read and write latency (throughput). Use this information to determine if the host overall is issuing many IOs to the datastore or is experiencing elevated latencies to the datastore.

If the host-level IOPs to one or more datastores is high, identify other VMs running on the same host that are issuing IOs to the datastores. Migrate identified VMs to other hosts. If DRS is auto mode, consider switching DRS to partially automated temporarily to avoid DRS moving VMs back to the same host. On the other hand, if the latency is high, use the VCF Operations and Management topology tab for the datastore to determine the VMs with virtual disks on the datastore. Consider migrating some of the virtual disks to other datastores. Use the metrics tab for individual VMs to determine the IOPs a given VM is issuing to the datastore. Note: latency values can be misleading if there are relatively few IOs being issued to the datastore.

For Problems with Snapshots Creation or Deletion

To Estimate snapshot deletion or consolidation durations

General snapshot issues