Troubleshooting a virtual machine that has stopped responding (VM hang/freeze)

search cancel

Troubleshooting a virtual machine that has stopped responding (VM hang/freeze)

book

Article ID: 326252

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article provides steps to isolate possible causes of a vSphere virtual machine becoming unresponsive.

An unresponsive virtual machine does not respond to any connection attempts and may be unable to respond to any attempts to power cycle it. There are a variety of reasons a virtual machine can end up in an unresponsive state. This article helps to identify and resolve these common causes, and, when resolved, return the virtual machine to an operational state.

It is possible to hard power off a virtual machine without troubleshooting the cause, but this will prevent collection and analysis of information which could assist with determining the root cause of the outage. For more information about shutting down the virtual machine, see Powering off an unresponsive virtual machine on an ESXi host.

This article assumes that the issue is currently occurring. If troubleshooting an issue that occurred in the past, some required information may be unavailable.

Virtual machines can become unresponsive/freeze/hang in the same way as a physical server for various reasons, such as:

Virtual machines are unresponsive to tasks or appear to be frozen/hang.
Tasks performed on the virtual machine fail, timeout, or do not start.
Virtual machine does not produce network or disk traffic.
Virtual machine does not allow access via RDP, vCenter Server virtual machine console screen, or other connection methods.
Virtual machines are unreachable over the network.
Suspending a virtual machine on ESX/ESXi to collect diagnostic information
The VSCSI I/O warning counters in the esxtop indicates inactivity.
Virtual machines may report an invalid state
The screen is frozen, and no actions are possible.
The affected virtual machines have a black console with an error similar to:

Virtual machine screen is black and does not refresh.

One or more of the following errors might be seen:

/init: /init: 151: Syntax error: 0xforce=panic
Kernel panic - not syncing: Attempted to kill init!
PAGE_FAULT_IN_NONPAGED_AREA
Error codes 6005, 6008 in Windows Event Viewer.

Environment

VMware vSphere

Cause

These situations can be caused by:

Virtual machine backups and schedules may create heavy I/O load related tasks leading to unresponsive issues.
Virtual machine may become unresponsive when the hard disk runs out of space.
Guest Operating system on the virtual machine may cause VM to appear unresponsive.
The virtual machine disk controllers not set as per best practices.
If there is a network firewall between the ESXi host and the vSphere Client.

Resolution

The services a virtual machine (VM) provides may become unresponsive or unreachable due to several causes. These include problems with the applications or guest OS, issues with the Virtual Machine Monitor (VMM) or virtual devices, resource contention on the host, or underlying storage and networking infrastructure issues.

If the guest OS is producing any activity, it is successfully running. In this case, unresponsiveness is likely due to a connectivity problem, resource contention, or a higher-level component (such as an application or service) running within the guest OS.

Phase 1: Validate the Scope

It is important to document accurate symptoms and understand the scope of the problem. Work through the following checks:

Confirm VM unresponsiveness: A VM may stop responding via one interface but function correctly on others. (See: Confirming whether virtual machine is unresponsive.).
- Note: If the VM is responsive but performing poorly, refer to Troubleshooting ESX virtual machine performance issues.
Verify the power state: Ensure the VM is actually powered on. If it powered off unexpectedly, power it back on and investigate the shutdown cause. (See: Powering on an ESX/ESXi host's virtual machine and Determining why a virtual machine was powered off or restarted).
- Note: If the VM cannot be powered on, refer to Troubleshooting a virtual machine that is unable to power on.
Determine the number of affected VMs: Are multiple VMs affected or just one? If multiple, look for shared infrastructure dependencies (e.g., specific datastores or hosts).
Test console interaction: Check if the guest OS responds to interaction at the VM console. If it does, the issue is likely isolated to the guest OS or internal applications. (See: Troubleshooting virtual machine network connection issues).
Check for critical errors: Determine if the guest OS has reported critical errors to the console and is in a halted state. (See: Identifying critical Guest OS failures within virtual machines).
Check host responsiveness: Determine if the ESX/ESXi host is also unresponsive. If it is, the scope is larger than a single VM. (See: Determining why an ESX/ESXi host does not respond to user interaction at the console).

Phase 2: Identify the Cause

Once you have established that the VM is unresponsive at both the virtual console and via the network all the while the host remains responsive, investigate the underlying cause:

Review recent operations: Did a specific task trigger the issue? For example, snapshot and vMotion operations briefly "stun" a VM while memory state is copied to disk or across the network. (See: Taking a snapshot with virtual machine memory stuns the virtual machine while the memory is written to disk).
Verify configurations: Review VM and host configurations for common errors that cause unresponsiveness, such as waiting for an unavailable resource.
Validate backing infrastructure: VMs depend on functional storage and networking. If the backing infrastructure fails, the virtual hardware presented to the guest OS is impacted. (See: ESX Server virtual machines stop responding due to shared storage connectivity issues, Verifying that ESX/ESXi virtual machine storage is accessible, Troubleshooting virtual machine network connection issues).
Check resource availability: A problem with CPU/Memory availability or scheduling can cause unresponsiveness. Check if the VM is blocking on unavailable resources or spinning at 100% vCPU utilization. (See: Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison).

Phase 3: Action Plan

At this stage, you have verified that the host is responsive, there are no shared storage/network outages, and the guest OS has not failed with a critical error and yet the VM remains unresponsive.

Choose the appropriate action plan below based on the suspected architectural layer.

Scenario A: Issue is isolated to the Guest OS (or `%RUN` is high)

If the VMM is functioning correctly, the guest OS may be hanging just as it would on physical hardware. (See: Troubleshooting unresponsive guest operating system issues.).

Collect performance data while the problem is actively occurring.
Attempt to manually induce a kernel panic inside the guest OS to collect internal state information.
- Windows: Microsoft article 927069: How to generate a complete crash dump file or a kernel crash dump by using an NMI on a Windows-based system
- Linux: Linux Documentation Project article: Magic SysRq key
- Note: If this produces useful diagnostic info, engage the guest OS vendor. If not, proceed to step 3.
Suspend the virtual machine to collect its internal state. For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information.
- Hardware version 10 and older: Collect the .vmss file.
- Hardware version 11 and newer: Collect both the .vmss and .vmem files.
- Note: If a management task prevents suspension, see Restarting the Management agents on an ESX or ESXi Server. If suspension still fails, skip to Scenario B.
Collect diagnostic logs from the host running the VM. For more information see Collecting diagnostic information for VMware products.
Power cycle the virtual machine (Power on, then Reset).
Engage VMware Support and provide the performance data, suspend state files, and host logs collected in steps 1, 3, and 4. For more information, see Creating and managing Broadcom support cases.

Scenario B: Issue is isolated to the Virtual Machine Monitor (or `%VMWAIT` is high)

If attempts to suspend the VM fail, but the host is stable, force a crash.

Collect performance data while the problem is actively occurring.
Forcefully crash the virtual machine to collect information about its internal state.
Engage VMware Support and provide the performance data and VM crash state information collected in steps 1 and 2. For more information, see Creating and managing Broadcom support cases.

Scenario C: Issue is isolated to the VMkernel (VMM issues, but suspend/crash fails)

If you cannot suspend or crash the VM, this indicates a problem with the VMkernel. You must intentionally crash the host to gather data.

Collect performance data while the problem is actively occurring.
Evacuate unaffected VMs off the host using vMotion. Place the host in Maintenance Mode to prevent new VMs from starting.
Configure the host to panic on receiving a non-maskable interrupt (NMI), then issue an NMI. (See: Using hardware NMI facilities to troubleshoot unresponsive hosts).
Capture the Purple Diagnostic Screen (PSOD). Take a screenshot or photograph of the console once the diagnostic dump completes, then restart the host.
Collect diagnostic information from the host upon reboot.
Engage VMware Support and provide the performance data, PSOD screenshot, and host logs collected in steps 1, 4, and 5. For more information, see Creating and managing Broadcom support cases.

Additional Information

Restarting the Management agents in ESXi
Powering on an ESX/ESXi host's virtual machine
Verifying that ESX/ESXi virtual machine storage is accessible
Troubleshooting virtual machine network connection issues
Identifying critical Guest OS failures within virtual machines
Collecting diagnostic information for VMware products
ESX Server virtual machines stop responding due to shared storage connectivity issues
Powering off an unresponsive virtual machine on an ESXi host
Confirming whether a virtual machine is unresponsive
Virtual machine becomes unresponsive or inactive when taking a snapshot
Using hardware NMI facilities to troubleshoot unresponsive hosts
Determining why an ESX/ESXi host does not respond to user interaction at the console
Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison
Determining why a virtual machine was powered off or restarted
Troubleshooting ESX/ESXi virtual machine performance issues
Troubleshooting a virtual machine that is unable to power on

Feedback

thumb_up Yes

thumb_down No