VM Disk Consolidation Takes Extended Time (5+ Hours) After Power Event
search cancel

VM Disk Consolidation Takes Extended Time (5+ Hours) After Power Event

book

Article ID: 420095

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms: 

  • lck file created on VM VMDK
  • A Virtual Machine (VM) disk consolidation operation, which can occur after a power outage or a failed snapshot cleanup, takes an excessively long time to complete
  • VM cannot be powered on / VM inaccessible 

Environment

VMware ESX

Cause

The extended consolidation time is typically due to a long chain of snapshot delta files (VMDK files) that must be merged back into the base disk. This process is resource-intensive, requiring significant I/O and time, especially for large capacity disks (e.g., 5TB) with numerous delta files.

Resolution

The following steps and recommendations focus on immediate actions to optimize the recovery process and long-term actions to prevent recurrence.

Immediate Recovery Steps to Reduce Downtime

If a VM is inaccessible due to a perceived lock or pending consolidation, the initial recovery can be sped up by avoiding ESXi host reboot to clear the lock on the VM. 

 

  1. Identify and Clear the VM Lock Manually:
    • Before resorting to a host reboot, check for and manually clear any locks on the VM's .vmx file. This often allows the VM to power on without the time expenditure of a host reboot.
      See KB 314365 for detailed steps on how to do this

  2. Once the lock has been cleared, and the VM consolidation has started, follow the below recommendations to help optimize disk consolidation performance:
      1. Move VM to Faster Storage: If possible, perform a Storage vMotion to an adjacent datastore on faster storage (e.g., all-flash/NVMe array) to improve the I/O-intensive merge speed.
      2. Reduce Host Load: Utilize vMotion to move non-critical VMs off the ESXi host to free up I/O, CPU, and Memory resources for the consolidation task.
      3. Limit Competing Activity: Ensure that no large file copies, backups, or other I/O-heavy tasks are running on the datastore while consolidation is active.

Additional Information

The best way to reduce consolidation downtime is to prevent the long chain of snapshots from accumulating in the first place.

Preventative Measures

  • Backup Software Verification: Ensure your backup software is correctly configured and successfully deleting the snapshots it creates after a successful backup job. Interruptions, such as power outages, can leave these clean-up processes incomplete.
  • Implement Snapshot Monitoring Alarms:
  • Configure monitoring to alert you when a VM snapshot is older than a set threshold (e.g., 24-48 hours).
  • Set up alarms to trigger when the VM's snapshot chain is too long or when a consolidation is officially reported as required for a VM.