VMs taking 9 minutes to boot during automation pipeline
search cancel

VMs taking 9 minutes to boot during automation pipeline

book

Article ID: 429866

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

During automated deployment pipelines (such as nightly QA or CI/CD deployments), Virtual Machines may experience intermittent, extended delays (e.g., up to 9 minutes) during the power-on or deployment phase.

When observing the vCenter Task Console, the "Power On virtual machine" or "Relocating" tasks may remain in a "Running 0%" or "Running 99%" state for several minutes before finally completing. No immediate errors are presented, but the operations are heavily delayed.

Environment

VMware vCenter Server 7.0.x and 8.0.x

VMware vSphere ESXi 7.0.x and 8.0.x

Cause

This issue is caused by a combination of Network File Copy (NFC) stream saturation and Stale File Locking on the ESXi host's management agent (hostd).

When an automation framework executes multiple concurrent high-overhead storage tasks—such as "Full Clones" or "Revert to Snapshot" operations—the host's NFC limits are quickly saturated. If a "Power On" task is triggered while the previous clone or revert operation is still holding an exclusive lock on the VM's .vmdk or .vmx files (or while the lock is slowly clearing from the storage array), vCenter will place the task in a "Running 0%" state. The task sits in this queue until the lock is released.

Additionally, residual configurations (such as unsupported VMware Workstation lines in the .vmx file) or disconnected/broken NFS mounts attached to the VM's CD-ROM can cause the ESXi hostd agent to hang while attempting to validate resource paths during the boot sequence.

Resolution

 

  • Identify the ESXi hosts where the stuck VMs reside.

  • If tasks are permanently hung and VMs become orphaned/inaccessible, gracefully migrate healthy VMs off the host and perform a reboot of the affected ESXi host to clear the hung hostd processes and stale file handles.

  • Remove the orphaned VMs from the vCenter inventory and re-register them.

  • Audit the VM configurations: ensure there are no broken NFS mounts mapped to virtual CD-ROM drives, and strip any non-standard/unsupported lines from the .vmx files.

 

Additional Information

Preventative Automation Adjustments:

  1. Change Provisioning Type: Transition automation scripts from using "Full Clones" to "Linked Clones" to drastically reduce I/O overhead and prevent NFC stream saturation.

  2. Increase Stagger Delays: If the pipeline relies on a Revert to Snapshot -> Power On workflow, increase the programmatic delay between these two steps (e.g., from 30 seconds to 120 seconds) to ensure the storage array has ample time to release the file locks.

  3. Throttle Concurrency: Limit the maximum number of simultaneous deployments within the automation tool to align with the host's supported concurrent task limits.

Contributing Articles: