Troubleshooting a Virtual Machine that has stopped responding (VM hang/freeze)

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article provides steps to isolate possible causes of a vSphere virtual machine becoming unresponsive.

An unresponsive virtual machine does not respond to any connection attempts and may be unable to respond to any attempts to power cycle it. There are a variety of reasons a virtual machine can end up in an unresponsive state. This article helps to identify and resolve these common causes, and, when resolved, return the virtual machine to an operational state.

It is possible to hard power off a virtual machine without troubleshooting the cause, but this will prevent collection and analysis of information which could assist with determining the root cause of the outage. For more information about shutting down the virtual machine, see Powering off an unresponsive virtual machine on an ESXi host.

This article assumes that the issue is currently occurring. If troubleshooting an issue that occurred in the past, some required information may be unavailable.

Symptoms:

Virtual machines can become unresponsive/freeze/hang in the same way as a physical server for various reasons, such as:
Virtual machines are unresponsive to tasks or appear to be frozen/hang.
Tasks performed on the virtual machine fail, timeout, or do not start.
Virtual machine does not produce network or disk traffic.
Virtual machine does not allow access via RDP, vCenter Server virtual machine console screen, or other connection methods.
Virtual machines are unreachable over the network.
Suspending a virtual machine on ESX/ESXi to collect diagnostic information
The VSCSI I/O warning counters in the esxtop indicates inactivity.
Virtual machines may report an invalid state
The screen is frozen, and no actions are possible.
The affected virtual machines have a black console with an error. The following error may be seen:

Virtual machine screen is black and does not refresh.

One or more of the following errors might be seen:

/init: /init: 151: Syntax error: 0xforce=panic
Kernel panic - not syncing: Attempted to kill init!
PAGE_FAULT_IN_NONPAGED_AREA
Error codes 6005, 6008 in Windows Event Viewer.

Environment

VMware vSphere

Cause

These situations can be caused by:

Virtual Machine backups and schedules may create heavy I/O load related tasks, may lead to unresponsive issues.
Virtual machine may become unresponsive when the hard disk runs out of space.
Guest Operating system on the Virtual machine may cause VM to appear unresponsive.
The virtual machine disk controllers not set as per best practices.
If there is a Network firewall between the ESXi host and the vSphere Client.

Resolution

The services a virtual machine provides may become unresponsive or unreachable due to a number of causes, including problems with the applications or guest OS within the virtual machine, problems with the virtual machine monitor or virtual devices, resource contention on the host, or issues with underlying storage or networking infrastructure.

If the guest OS is producing any activity, it is successfully running. In this case, unresponsiveness is likely due to a connectivity problem or resource contention or is specific to a higher-level component such as an application or service running within the guest OS.

Validate the scope

It is important to have accurate symptoms and an understanding of the scope of a problem. To confirm the scope of the problem, work through these checks:

Confirm that the virtual machine is actually unresponsive. It is possible that the virtual machine is not responding via one interface but is functioning correctly on others. For more information on testing whether a virtual machine is genuinely unresponsive, see Confirming whether virtual machine is unresponsive.

If a virtual machine is responsive, but performing poorly, see Troubleshooting ESX virtual machine performance issues.
Verify that the virtual machine is powered on. If the virtual machine has been powered off unexpectedly, power it back on and then troubleshoot the cause of the unexpected shutdown. For more information, see:

Note: If a virtual machine is powered off and cannot be powered back on, see Troubleshooting a virtual machine that is unable to power on.

Determine whether this issue is affecting multiple virtual machines or just one. If multiple virtual machines are affected, consider the similarities between the affected virtual machines when attempting to narrow the potential scope. In particular, focus on shared infrastructure which the group of affected virtual machines depend on, and whether all virtual machines depending on that common infrastructure are affected.
Determine whether the guest OS is responsive to interaction at the virtual machine console. If an issue has been isolated to the guest OS or applications within the virtual machine, and the guest OS is responsive at the console, interact with the guest OS at the console to address the problem. For more information, see Troubleshooting virtual machine network connection issues.
Determine whether the guest OS has reported any critical errors to the console and is sitting in a halted state. For more information, see Identifying critical Guest OS failures within virtual machines.
Determine whether the ESX/ESXi host is unresponsive too. If the host is unresponsive as well, the scope is larger than initially assumed. For more information, see Determining why an ESX/ESXi host does not respond to user interaction at the console.

Identify the cause

At this point, you have established that one or more virtual machines are unresponsive at both the virtual console and via the network. The host itself is responsive. A problem may exist with resource accessibility or contention, or with underlying storage or networking infrastructure.

To identify the cause:

Determine whether the problem is triggered by an operation or task being performed on the virtual machine. For example, snapshot and vMotion operations both stun a virtual machine for brief periods of time while memory state is copied across the network or to disk. For more information, see Taking a snapshot with virtual machine memory stuns the virtual machine while the memory is written to disk.
Some common configuration errors can also lead to a virtual machine becoming unresponsive. Such as while waiting for a resource. Hence, review the virtual machine and host configurations.
Virtual machines depend on functional backing infrastructure. If there is an issue with the backing storage or networking infrastructure which the virtual machine depends on, the virtual hardware which a virtual machine presents to the guest OS may be impacted. Address the underlying storage or networking issue. For more information, see:
Virtual machines depend on available host resources (CPU, Memory), and the guest OS consumes those resources. A problem with resource availability or scheduling inside or outside the virtual machine may cause it to become unresponsive. The virtual machine may also be blocking on unavailable resources or spinning at 100% vCPU utilization. For more information, see Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison (1017926).

Action Plan

At this point, you have established that the host running the virtual machine(s) is both responsive and not encountering any shared storage or networking infrastructure issues. The guest OS has not failed with a critical error, but remains unresponsive at the virtual machine console and via the network.

Take action to recover or collect information about the unresponsive virtual machine based on the architectural layer which is suspect:

If an issue has been isolated to the guest OS, or the %RUN is relatively high, but the virtual machine monitor is functioning correctly, move investigation to within the virtual machine's guest OS or applications. A guest OS can become unresponsive inside a virtual machine in the same way it can on physical hardware. For more information, see Troubleshooting unresponsive guest operating system issues.
1. Collect performance data while the problem is happening.
  - Microsoft article 927069: How to generate a complete crash dump file or a kernel crash dump by using an NMI on a Windows-based system
  - Linux Documentation Project article: Magic SysRq key
    
    Attempt to manually induce a panic of the kernel inside the guest OS to collect additional information about its internal state. For more information, see:
    
    If useful diagnostic information is produced by the guest OS in response to one of these events, engage the guest OS vendor to investigate further.
2. If the above step does not produce useful information, suspend the virtual machine to collect information about its internal state and open a case with VMware Support. For more information, see below.
  
  Note: If the virtual machine cannot be suspended because another management task is in progress, see Restarting the Management agents on an ESX or ESXi Server. If attempts to suspend the virtual machine fail and no management task appears to be present, skip to the next section and attempt to crash the virtual machine.
  1. Suspend the virtual machine and collect the suspend state files.
    - For virtual machines with hardware versions up until 10, this will be the file with the suffix .vmss.
    - For virtual machines with hardware version 11 or newer, there will be 2 files, one with the suffix .vmss, and another one with the suffix .vmem. Collect both.
    - For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information.
  2. Collect logs from the host running the virtual machine. For more information see Collecting diagnostic information for VMware products.
  3. Power the virtual machine back on, then reset it.
  4. Engage VMware Support, providing the information collected in steps 1, 3a and 3b. For more information, see Creating and managing Broadcom support cases.

If an issue has been isolated to the virtual machine monitor, or the %VMWAIT is relatively high, or attempts to suspend the virtual machine have failed, collect performance data and forcefully crash the virtual machine to collect additional information about its internal state.
1. Collect performance data while the problem is happening.
2. Crash the virtual machine to collect information about its internal state.
3. Engage VMware Support, providing the information collected in steps 1 and 2. For more information, see Creating and managing Broadcom support cases.

If an issue has been isolated to the virtual machine monitor but attempts to suspend or crash the virtual machine fail, this reflects a problem with the Vmkernel. Collect a log bundle from the host, evacuate all unaffected virtual machines from the host, and use an NMI to intentionally generate a purple diagnostic screen.
1. Collect performance data while the problem is happening.
2. Move all unaffected virtual machines off of the host using vMotion. If possible, use Maintenance Mode to prevent additional virtual machines from being started on the host.
3. Configure the host to panic on receiving a non-maskable interrupt and then issue an NMI to trigger a panic. For more information, see Using hardware NMI facilities to troubleshoot unresponsive hosts.
4. After the host has generated a purple diagnostic screen and completed dump of diagnostic information, take a screenshot or photograph of the console and restart the host.
5. Collect diagnostic information from the host. For more information, Collecting diagnostic information for VMware products
6. Engage VMware Support, providing the information collected in steps 1, 4 and 5. For more information, see Creating and managing Broadcom support cases.

Additional Information

Restarting the Management agents in ESXi
Powering on an ESX/ESXi host's virtual machine
Verifying that ESX/ESXi virtual machine storage is accessible
Troubleshooting virtual machine network connection issues
Identifying critical Guest OS failures within virtual machines
Collecting diagnostic information for VMware products
ESX Server virtual machines stop responding due to shared storage connectivity issues
Powering off an unresponsive virtual machine on an ESXi host
Confirming whether a virtual machine is unresponsive
Virtual machine becomes unresponsive or inactive when taking a snapshot
Using hardware NMI facilities to troubleshoot unresponsive hosts
Determining why an ESX/ESXi host does not respond to user interaction at the console
Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison
Determining why a virtual machine was powered off or restarted
Troubleshooting ESX/ESXi virtual machine performance issues
Troubleshooting a virtual machine that is unable to power on