Troubleshooting a single virtual machine failure on an ESXi host

Products

VMware vSphere ESXi

Issue/Introduction

This article guides readers through the process of troubleshooting a single virtual machine that is failing on an ESXi host. It is primarily suited for addressing repetitive failures when the cause is unknown. If the failure is reproducible—meaning it can be repeated by following a sequence of steps—readers should follow the instructions at the end of this knowledge base article to gather the support script data and file a support request. If the failure is a one-off, gathering the support script data and filing a support request is recommended; include as much information about the environment and what was happening at the time of the failure, as possible. Virtual machine failures may be caused by factors outside of VMware and the cause is not always evident from the support script data.

The guest operating system has terminated unexpectedly
The virtual machine is not accessible
A blue screen with a Stop error code may be visible on the console
An error including the term kernel panic is visible on the console
Errors similar to:
- BAD_POOL_HEADER
- KMODE_EXCEPTION_NOT_HANDLED
- PAGE_FAULT_IN_NONPAGED_AREA
- STOP: 0x00000050 (0xFFFFFFF8,0x00000000,0xF9CF5C88,0x00000000)
- STOP: 0x00000019 (0x00000000,0xC00E0FF0,0xFFFFEFD4,0xC0000000)
- Unknown inaccessible
- SCSI: 4506: Cannot find a path to device vmhbax:x:x in a good state
- WARNING: LVM: 4844: vmhbaH:T:L:P detected as a snapshot device. Disallowing access to the LUN since signaturing is turned off.
- Date esx vmkernel: Time cpu3: 10340 SCSI: 5637: status SCSI LUN is in snapshot state, rstatus 0xc0de00 for vmhbax:x:x. residual R 999, CR 8-, ER3
- Date esx vmkernel: Time cpu3: world ID SCSI 6624: Device vmhbax:x:x. is a deactivated snapshot

Resolution

Validate that each troubleshooting step below is true for the environment. Each step will provide instructions or a link to a document, in order to eliminate possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.

Verify that the virtual machine is not in an unresponsive state.

During an unresponsive state, the operating system seems to be paralyzed, no error messages are displayed, and the screen freezes or the application does not respond to users' actions. Keyboard input or mouse clicking has no effect, regardless of where the cursor is placed, but the operating system is still running. Unlike a failure, sometimes an unresponsive system resolves itself, and the application resumes its normal execution without user involvement.

A failure is a situation where the operating system has terminated and is no longer running. There may be a diagnostic screen or error message visible in its place.

Note: There is a difference between a virtual machine failing and the guest operating system failing. If the virtual machine fails, it powers off and vmware-core files may have been created in the virtual machines host directory. Checking the vmware.log file, the following entries may appear:

Sep 13 19:58:46: vcpu-1| MONITOR PANIC: ASSERT failed
Sep 13 19:58:46: vcpu-1| Core dump with build build-10104
Sep 13 19:58:46: vcpu-1| Writing monitor corefile
"/root/vmware/vm1/vmware-core0.gz"|
Verify that the guest operating system is fully certified for the ESXi host version.

If the guest operating system is not listed, the following steps may help to resolve the issue, but be aware that problems may be encountered in an uncertified guest operating system.
Verify that access to the storage hosting the virtual machine is available.

Virtual machines may fail if the LUN on which it is stored becomes unavailable.

To check this:
1. SSH to the ESXi host via root
2. Navigate to the working directory of the VM
  
  Example:
  cd /vmfs/volumes/datastore/vm1/
3. If the files associated with the virtual machine (VMDK, VMX, NVRAM) are listed, there is working access to the storage hosting the virtual machine.
  
  If not, refer to Identifying Fibre Channel, iSCSI, and NFS storage issues on ESX/ESXi hosts
Verify no software changes have been made that may have caused the failure. For more information, see Identifying critical Guest OS failures within virtual machines
Verify no hardware changes have been made that may have caused the failure. If recent changes have occurred to the virtual machine's hardware configuration, back them out temporarily for testing purposes. For more information, see Verifying the Virtual Hardware configuration of a virtual machine.